🚀 S3 Just Killed the Vector Database: How Amazon S3 Vectors Changes Everything for AI Data Storage 💾

What if I told you that you could run vector searches directly on S3 without spinning up a single database or compute cluster? For years, we’ve been stuck with a painful pipeline: extract data from S3, chunk it, generate embeddings, load everything into OpenSearch or Pinecone, and manage all that infrastructure. Amazon just changed the game with S3 Vectors – it’s S3 that can do vector math natively, no compute engine required. This means up to 90% cost savings and zero infrastructure management. Let me show you exactly how this works and why it might replace your vector database entirely. ...

August 10, 2025 · 7 min · 1458 words · Vesko Vujovic

💡 Spark Caching: When It Helps and When It Hurts Your Performance 🔧

Ever had a Spark job that keeps re-reading the same data over and over? You might need caching. But cache at the wrong time, and you’ll actually slow things down. Here’s when caching helps, when it doesn’t, and how to use it right. This blog will be short but sweet! 🔍 What is Caching? Think of caching like keeping your frequently used files on your desk instead of walking to the filing cabinet every time you need them. ...

July 20, 2025 · 5 min · 928 words · Vesko Vujovic

🦾 Picture Perfect Match: Building an Image Similarity Search Engine with Vector Databases🤖

Introduction Have you ever wondered how Pinterest finds visually similar images or how Google Photos recognizes faces across thousands of pictures? The technology that powers these features isn’t magic—it’s vector similarity search. Today, modern vector databases make it possible for developers to build these powerful visual search capabilities without needing a PhD in computer vision. In this post, I’ll guide you through the process of building your own image similarity search engine. We’ll cover everything from understanding vector embeddings to implementing a working solution that can find visually similar images in milliseconds. ...

May 15, 2025 · 9 min · 1729 words · Vesko Vujovic

📊 The Analytics Self-Service Revolution: How Data Catalogs Empower Enterprise Teams 💡

Introduction Picture this: Your marketing team needs customer data for an upcoming campaign. You submit a request to IT, initiating a complex process that involves multiple teams, approval chains, and coordination across departments. The request joins a backlog of similar requests, each requiring data team resources. This scenario plays out daily in enterprises worldwide. What should be simple data requests turn into lengthy processes that can stretch for weeks or months. ...

April 24, 2025 · 13 min · 2588 words · Vesko Vujovic

🏗️ Why Data Warehouses Backed by Open Table Formats Could Completely Replace Traditional DWHs 🌊

Introduction The data warehouse landscape is experiencing a tectonic shift. After decades of dominance by traditional vendor-locked solutions, a new way of thinking is emerging: data warehouses built on open table formats. This architectural approach isn’t just another incremental improvement—it represents a fundamental reimagining of how organizations store, manage, and analyze their critical data assets. Open table formats like Apache Iceberg, Delta Lake, and Apache Hudi are transforming what’s possible in data warehousing. By decoupling storage from compute and leveraging cloud-native technologies, these formats enable data architectures that are more flexible, cost-effective, and powerful than their traditional counterparts. ...

April 5, 2025 · 13 min · 2584 words · Vesko Vujovic

🚨 The Hidden Pitfall That Sabotages SQL Performance: Functions on Indexed Columns 📉

Introduction As data engineers and analysts, we rely heavily on SQL databases to store and query our data efficiently. To speed up our queries, we often create indexes on frequently filtered columns. However, there’s a common gotcha that can cause our queries to run slower than expected, even with appropriate indexes in place. In this post, we’ll explore how applying functions to indexed columns in the WHERE clause can prevent SQL optimizers from utilizing those indexes effectively. ...

February 5, 2025 · 6 min · 1250 words · Vesko Vujovic

Speed Up Your Spark Jobs: The Hidden Trap in Union Operations

The Problem: Union function isn’t as Simple as it Seems Picture this: You have a large dataset that you need to process in different ways, so you: Split it into smaller pieces Transform each piece differently Put them back together using union Sounds straightforward, right? Well, there’s a catch that most developers don’t know about. The Hidden Performance Killer 🐌 Here’s what’s actually happening behind the scenes when you use union: ...

November 29, 2024 · 4 min · 844 words · Vesko Vujovic

AWS Lambda Event Source Mapping: The Magic Behind Kafka Offset Management

Introduction When building event-driven architectures with AWS Lambda and Apache Kafka, one of the most critical yet often misunderstood components is offset management especially for event source mapping when you use lambda functions. Many developers wonder: Do I need to manage Kafka offsets manually? or What happens when my consumer group’s offsets expire? In this blog post, we’ll demystify how AWS Lambda’s Event Source Mapping handles Kafka offsets automatically and what you actually need to know as a developer. ...

November 16, 2024 · 5 min · 1002 words · Vesko Vujovic

DuckDB Inside Postgres: The Unlikely Duo Supercharging Analytics

DuckDB Inside Postgres: The Unlikely Duo Supercharging Analytics If you work in data engineering, you know that the field moves at a dizzying pace. New tools and technologies seem to pop up daily, each promising to revolutionize how we store, process, and analyze data. Amidst this constant change, two names have remained stalwarts: Postgres, the tried-and-true relational database, and DuckDB, the talented new kid on the block for analytics workloads. ...

October 30, 2024 · 9 min · 1730 words · Vesko Vujovic

Apache Spark: Beware of Column Ordering and Data Types When Using Apache Spark's Union Function

Introduction In this blog post, we’ll zoom into the details of how column ordering and data types can cause issues when using the union function in Apache Spark to combine two dataframes. We’ll explore real-world examples that illustrate the problem and provide practical solutions to overcome these challenges. By the end of this post, you’ll have a better understanding of how to use union effectively and avoid common pitfalls that can lead to job failures. ...

October 6, 2024 · 5 min · 1038 words · Vesko Vujovic