Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.
What is a Straggler in a Spark Application ?:
A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. …
Let us assume there is a very large Dataset ‘A’ with the following schema:
root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer
The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’ (can be broadcasted to executors), to get a filtered Dataset ‘A`’. The filter operation can be represented as:
A` = A.filter(A.empId contains in 'B')
To achieve this most common filtering scenario, you can use four types of transformation in Spark, each one having its own pros and cons. Here is a description of the usage of all these four transformations to execute this particular filtering scenario along with detailed notes on the reliability and efficiency aspects of each of these. …
Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.
At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. …
I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.
During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.
Therefore, based on our experience, knowledge, and research, I and my colleague Naushad decided to write a book focusing just on this one important aspect of Apache Spark, i.e., partitioning. The book’s title, “Guide to Spark Partitioning” is also aligned with this single objective of the book. …
Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:
a) Shuffle Write: Shuffle map tasks write the data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. Bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. …
While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of the Spark application. The first thing the main method does is the initialization of the Spark context which in turn hosts the key components of the driver responsible for driving & supervising the cluster execution of the underlying Spark application. …
Aggregation operator is heavily used across Spark applications meant for data mining and analytics. Therefore, Spark has provided both, a wide variety of readymade aggregation functions and a framework to built custom aggregation functions. These aggregations functions can be used on Datasets in a variety of ways to derive aggregated results.
With the Custom Aggregation framework, a user can implement a specific aggregation flow for aggregating a set of records. For Custom Aggregation, prior releases of Spark have provided two approaches, first is based on ‘UserDefinedAggregationFunction’ and the second is based on ‘Aggregator’.
The two approaches are explained in detail in a previous story, titled, “UDAF and Aggregators: Custom Aggregation Approaches for Datasets in Apache Spark”. Out of the two approaches, ‘UserDefinedAggregationFunction’, also called as UDAF, is introduced first to support custom aggregation on Dataframes, untyped view of Data, in Spark. However, later, with the introduction of Dataset, which supports both typed, and untyped view of Data, the ‘Aggregator’ approach is introduced additionally to support custom aggregation on the typed view of Data. …
The majority of Spark applications source input data for their execution pipeline from a set of data files (in various formats). To facilitate the reading of data from files, Spark has provided dedicated APIs in the context of both, raw RDDs and Datasets. These APIs abstract the reading process from data files to an input RDD or a Dataset with a definite number of partitions. Users can then perform various transformations/actions on these inputs RDDs/Datasets.
You can also read my recently published book, “Guide to Spark Partitioning” which deep dives into all aspects of of Spark Partitioning with multiple examples to explain each of the partitioning aspect in detail: https://www.amazon.com/dp/B08KJCT3XN/ …
Data in Spark remain always partitioned right after reading from a data source, during intermediate transformation(s), and till the point when an action is performed to produce the desired output. The partitioned data at each stage is represented by a low-level abstraction, called RDD. Programmers can directly use RDDs to write Spark applications. However, optionally, higher-level abstraction (built on top of RDD), called as Dataset, is also available to users to write Spark applications.
Spark execution pipeline is also built around data partitions only. A typical pipeline includes reading of one or more partitions of an input RDD, computing intermediate partition(s) of intermediate RDD(s), and finally applying an action on the computed partition(s) of the desired RDD. …
A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity, reliability and resiliency to spark application execution. Below are the various important aspects related to Spark Stages:
Stages are created, executed and monitored by DAG scheduler: Every running Spark application has a DAG scheduler instance associated with it. This scheduler create stages in response to submission of a Job, where a Job essentially represents a RDD execution plan (also called as RDD DAG) corresponding to a action taken in a Spark application. Multiple Jobs could be submitted to DAG scheduler if multiple actions are taken in a Spark application. For each of Job submitted to it, DAG scheduler creates one or more stages, builds a stage DAG to list out the stage dependency graph, and then plan execution schedule for the created stages in accordance with stage DAG. Also, the scheduler monitors the status of the stage execution completion which could turn out to be success, partial-success, or failure. …