Apache Spark: Why JSON isn't ideal format for your spark job

Introduction Hi there 馃憢! In this blog post, we will explore why JSON is not suitable as a big data file format. We鈥檒l compare it to the widely used Parquet format and dig deep to demonstrate, through examples, how the JSON format can significantly degrade the performance of your data processing jobs. JSON (JavaScript Object Notation) is a popular and versatile data format, but it has limitations when dealing with large-scale data operations. On the other hand, Parquet, an open-source columnar storage format, has become the go-to choice for big data applications. ...

September 9, 2024 路 5 min 路 924 words 路 Vesko Vujovic

AWS: Lambda Event Source Mapping with Confluent Kafka

Introduction Welcome, readers! 馃摉 In this post, we鈥檒l explore the idea of event source mapping in AWS, with a focus on its implementation and functionality. We鈥檒l zoom-in 馃攳 how automatic scaling works and examine the process of consuming messages from Kafka event sources. Using Lambda to consume records from Kafka Processing streaming data with traditional server-based technologies and Kafka consumers written in Scala can often introduce unnecessary overhead for simple tasks like creating custom sink consumers to save or delete data based on specific rules. In such scenarios, where complex data processing, joining multiple topics, or a full-fledged streaming application isn鈥檛 required, a Python-based Lambda function emerges as a viable alternative. ...

August 4, 2024 路 5 min 路 981 words 路 Vesko Vujovic

Apache Spark: Dataset vs Dataframe - The Tortoise and Hare

Introduction Hi, welcome to my first blog post. This post is the first one in a series of many that will follow. Who is Tortoise and who is Hare? Well, In many books about Apache Spark that I was reading, I didn鈥檛 found a clear idea of the performance of dataframes compared to the datasets. In this blog post, we will debunk that mystery and show some concrete results and insights regarding this matter. ...

July 21, 2024 路 5 min 路 976 words 路 Vesko Vujovic