9 posts tagged with "Spark"

Eviction formula for Spark Structured Streaming watermark

April 26, 2026 · 5 min read

Data Engineer

I thought watermark was a trivial concept, until I encounter cross-stream joins and out-of-order data. Handling unexpected event-time skew and late data arrival across multiple streams requires more than just a basic configuration that documentation often overlooks. This post is a technical deep dive into the lessons learned while debugging state expiration and late-arrival logic when developing and deploying complex streaming pipelines at my work.

Spark Structured Streaming ordered write

March 22, 2026 · 8 min read

Lam Tran

Data Engineer

Distributed message systems like Kafka are built for throughput and fault tolerance, not ordering. A Kafka topic splits data across multiple partitions — each partition maintains its own internal order, but there is no ordering guarantee across partitions. When Spark reads from multiple partitions in parallel, records arrive in the executor in arrival order, not event order. A transaction timestamped 10:00:03 sitting in a lagging partition will arrive after a transaction timestamped 10:00:47 from a faster one. From Spark's perspective, the later event came first.

This breaks any application where output correctness depends on sequence:

Transaction listings — rows must be displayed in the order they occurred. Out-of-order writes mean a user sees their payment history shuffled.
Running balance calculations — each row's balance is derived from all prior rows. A single late-arriving event invalidates every balance computed after it.

This post covers how to tackle this in Spark Structured Streaming using watermarking, stateful operations, and controlled write semantics.

banner image

Differences between Spark RDD, Dataframe and Dataset

April 17, 2024 · 7 min read

Lam Tran

Data Engineer

I have participated in fews technical interviews and have discussed with people topics around data engineering and things they have done in the past. Most of them are familiar with Apache Spark, obviously, one of the most adopted frameworks for big data processing. What I have been asked and what I often ask them is simple concepts around RDD, Dataframe, and Dataset and the differences between them. It sounds quite fundamental, right? Not really. If we have more closer look at them, there are lots of interesting things that can help us understand and choose which is the best suited for our project.

banner image

How Is Memory Managed In Spark?

July 7, 2023 · 8 min read

Lam Tran

Data Engineer

Spark is an in-memory data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute tasks across multiple computers. Spark applications are memory heavy, hence, it is obvious that memory management plays a very important role in the whole system.

Authorize Spark 3 SQL With Apache Ranger Part 2 - Integrate Spark SQL With Ranger

May 1, 2023 · 6 min read

Lam Tran

Data Engineer

In the previous blog, I have successfully installed a standalone Ranger service. In this article, I show you how we can customize the logical plan phase of Spark Catalyst Optimizer in order to archive authorization in Spark SQL with Ranger.

Authorize Spark 3 SQL With Apache Ranger Part 1 - Ranger installation

April 30, 2023 · 5 min read

Lam Tran

Data Engineer

Spark and Ranger are widely used by many enterprises because of their powerful features. Spark is an in-memory data processing framework and Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. Thus, Ranger can be used to do authorization for Spark SQL and this blog will walk you through the integration of those two frameworks. This is the first part of the series, where we install the Ranger framework on our machine, and additionally, Apache Solr for auditing.

Spark Catalyst Optimizer And Spark Session Extension

January 7, 2023 · 15 min read

Lam Tran

Data Engineer

Spark catalyst optimizer is located at the core of Spark SQL with the purpose of optimizing structured queries expressed in SQL or through DataFrame/Dataset APIs, minimizing application running time and costs. When using Spark, often people see the catalyst optimizer as a black box, when we assume that it works mysteriously without really caring what happens inside it. In this article, I will go in depth of its logic, its components, and how the Spark session extension participates to change the Catalyst's plans.

spark catalyst optimizer

Create A Data Streaming Pipeline With Spark Streaming, Kafka And Docker

September 11, 2022 · 9 min read

Lam Tran

Data Engineer

Architecture

Hi guys, I'm back after a long time without writing anything. Today, I want to share about how to create a Spark Streaming pipeline that consumes data from Kafka, everything is built on Docker.

Create A Standalone Spark Cluster With Docker

January 1, 2022 · 7 min read

Lam Tran

Data Engineer

Cluster Overview

Lately, I've spent a lot of time teaching myself how to build Hadoop clusters, Spark, Hive integration, and more. This article will write about how you can build a Spark cluster for data processing using Docker, including 1 master node and 2 worker nodes, the cluster type is standalone cluster (maybe the upcoming articles I will do about Hadoop cluster and integrated resource manager is Yarn). Let's go to the article.