Stragglers in your Spark Application affect the overall application performance and waste premium resources.

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

What is a Straggler in a Spark Application ?:

A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A…

Ref: Pixabay


Filtering a Spark Dataset against a collection of data values is commonly encountered in many data analytics flows. This particular story would explain four different ways to achieve the same.

Let us assume there is a very large Dataset ‘A’ with the following schema:

| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer

The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’ (can be broadcasted to executors), to get a filtered Dataset ‘A`’. The filter operation can be represented as:

A` = A.filter(A.empId contains in 'B')

To achieve this most common filtering scenario, you can use four types of transformation in Spark, each one having its own pros and cons. Here is…



This story is exclusively dedicated to the Join operation in Apache Spark, giving you an overall perspective of the foundation on which Spark Join technology is built upon.

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an…

Partitioning is one of the basic building blocks on which the Apache Spark framework has been built. Just setting the right partitioning across various stages, a lot of spark programs can be optimized right away.

I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.

During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.

Therefore, based on our experience, knowledge…


Most Spark developers spend considerable time in troubleshooting the Fetch Failed Exceptions observed during shuffle operations. This story would serve you the most common causes of a Fetch Failed Exception and would reveal the results of a recent poll conducted on the Exception.

Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:

a) Shuffle Write: Shuffle map tasks write the data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. Bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. …

Ref: Pixabay


Spark Driver hosted against a Spark application is solely responsible for driving and supervising the parallel execution of the later in a cluster of computing resources. This story focuses on the key components enabling the Spark Driver to perform its duties.

While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of the Spark application. The first thing the main method does is the initialization of the Spark context which in turn hosts the key components of the driver responsible for driving & supervising the cluster execution of the underlying Spark application. …

Photo by Doran Erickson on Unsplash


In the context of the recent official announcement on Spark 3.0, Aggregator would now become the default mechanism to perform Custom Aggregation on Datasets as Spark 3.0 addresses the key usability and coexistence concerns in the earlier Aggregator mechanism. Read the story to know the details.

Aggregation operator is heavily used across Spark applications meant for data mining and analytics. Therefore, Spark has provided both, a wide variety of readymade aggregation functions and a framework to built custom aggregation functions. These aggregations functions can be used on Datasets in a variety of ways to derive aggregated results.

With the Custom Aggregation framework, a user can implement a specific aggregation flow for aggregating a set of records. For Custom Aggregation, prior releases of Spark have provided two approaches, first is based on ‘UserDefinedAggregationFunction’ and the second is based on ‘Aggregator’.

The two approaches are explained in detail…

Image Source: Pexels


Continuing an earlier story on determining the number of partitions at critical transformations, this story would describe the reasoning behind the number of partitions created from the data file(s) in a Spark Application.

The majority of Spark applications source input data for their execution pipeline from a set of data files (in various formats). To facilitate the reading of data from files, Spark has provided dedicated APIs in the context of both, raw RDDs and Datasets. These APIs abstract the reading process from data files to an input RDD or a Dataset with a definite number of partitions. Users can then perform various transformations/actions on these inputs RDDs/Datasets.

You can also read my recently published book, “Guide to Spark Partitioning” which deep dives into all aspects of of Spark Partitioning with multiple examples…



The number of partitions plays a critical role in the execution of Spark applications. This story in two parts would serve as a guide to reason the number of partitions contained in an RDD or dataset

Data in Spark remain always partitioned right after reading from a data source, during intermediate transformation(s), and till the point when an action is performed to produce the desired output. The partitioned data at each stage is represented by a low-level abstraction, called RDD. Programmers can directly use RDDs to write Spark applications. However, optionally, higher-level abstraction (built on top of RDD), called as Dataset, is also available to users to write Spark applications.

Spark execution pipeline is also built around data partitions only. A typical pipeline includes reading of one or more partitions of an input RDD, computing intermediate…



Stage in Spark represents a logical unit of parallel computation. Many such stages assembled together builds the execution skeleton of a Spark application. This story tries to unravel the concept of Spark stage and describes important related aspects.

A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity, reliability and resiliency to spark application execution. Below are the various important aspects related to Spark Stages:

Stages are created, executed and monitored by DAG scheduler: Every running Spark application has a DAG scheduler instance associated with it. This scheduler create stages in response to submission of a Job, where…

Ajay Gupta

Big Data Architect, Apache Spark Specialist,

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store