馃毃 The Hidden Pitfall That Sabotages SQL Performance: Functions on Indexed Columns 馃搲

Introduction As data engineers and analysts, we rely heavily on SQL databases to store and query our data efficiently. To speed up our queries, we often create indexes on frequently filtered columns. However, there鈥檚 a common gotcha that can cause our queries to run slower than expected, even with appropriate indexes in place. In this post, we鈥檒l explore how applying functions to indexed columns in the WHERE clause can prevent SQL optimizers from utilizing those indexes effectively....

February 5, 2025 路 6 min 路 1250 words 路 Vesko Vujovic

Speed Up Your Spark Jobs: The Hidden Trap in Union Operations

The Problem: Union function isn鈥檛 as Simple as it Seems Picture this: You have a large dataset that you need to process in different ways, so you: Split it into smaller pieces Transform each piece differently Put them back together using union Sounds straightforward, right? Well, there鈥檚 a catch that most developers don鈥檛 know about. The Hidden Performance Killer 馃悓 Here鈥檚 what鈥檚 actually happening behind the scenes when you use union:...

November 29, 2024 路 4 min 路 844 words 路 Vesko Vujovic

AWS Lambda Event Source Mapping: The Magic Behind Kafka Offset Management

Introduction When building event-driven architectures with AWS Lambda and Apache Kafka, one of the most critical yet often misunderstood components is offset management especially for event source mapping when you use lambda functions. Many developers wonder: Do I need to manage Kafka offsets manually? or What happens when my consumer group鈥檚 offsets expire? In this blog post, we鈥檒l demystify how AWS Lambda鈥檚 Event Source Mapping handles Kafka offsets automatically and what you actually need to know as a developer....

November 16, 2024 路 5 min 路 1002 words 路 Vesko Vujovic

DuckDB Inside Postgres: The Unlikely Duo Supercharging Analytics

DuckDB Inside Postgres: The Unlikely Duo Supercharging Analytics If you work in data engineering, you know that the field moves at a dizzying pace. New tools and technologies seem to pop up daily, each promising to revolutionize how we store, process, and analyze data. Amidst this constant change, two names have remained stalwarts: Postgres, the tried-and-true relational database, and DuckDB, the talented new kid on the block for analytics workloads....

October 30, 2024 路 9 min 路 1730 words 路 Vesko Vujovic

Apache Spark: Beware of Column Ordering and Data Types When Using Apache Spark's Union Function

Introduction In this blog post, we鈥檒l zoom into the details of how column ordering and data types can cause issues when using the union function in Apache Spark to combine two dataframes. We鈥檒l explore real-world examples that illustrate the problem and provide practical solutions to overcome these challenges. By the end of this post, you鈥檒l have a better understanding of how to use union effectively and avoid common pitfalls that can lead to job failures....

October 6, 2024 路 5 min 路 1038 words 路 Vesko Vujovic

Apache Spark: Why JSON isn't ideal format for your spark job

Introduction Hi there 馃憢! In this blog post, we will explore why JSON is not suitable as a big data file format. We鈥檒l compare it to the widely used Parquet format and dig deep to demonstrate, through examples, how the JSON format can significantly degrade the performance of your data processing jobs. JSON (JavaScript Object Notation) is a popular and versatile data format, but it has limitations when dealing with large-scale data operations....

September 9, 2024 路 5 min 路 924 words 路 Vesko Vujovic

AWS: Lambda Event Source Mapping with Confluent Kafka

Introduction Welcome, readers! 馃摉 In this post, we鈥檒l explore the idea of event source mapping in AWS, with a focus on its implementation and functionality. We鈥檒l zoom-in 馃攳 how automatic scaling works and examine the process of consuming messages from Kafka event sources. Using Lambda to consume records from Kafka Processing streaming data with traditional server-based technologies and Kafka consumers written in Scala can often introduce unnecessary overhead for simple tasks like creating custom sink consumers to save or delete data based on specific rules....

August 4, 2024 路 5 min 路 981 words 路 Vesko Vujovic

Apache Spark: Dataset vs Dataframe - The Tortoise and Hare

Introduction Hi, welcome to my first blog post. This post is the first one in a series of many that will follow. Who is Tortoise and who is Hare? Well, In many books about Apache Spark that I was reading, I didn鈥檛 found a clear idea of the performance of dataframes compared to the datasets. In this blog post, we will debunk that mystery and show some concrete results and insights regarding this matter....

July 21, 2024 路 5 min 路 976 words 路 Vesko Vujovic