Image for post
Image for post
Ref: Pixabay.com

Stragglers in your Spark Application affect the overall application performance and waste premium resources.

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

What is a Straggler in a Spark Application ?:

A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. …


Image for post
Image for post
Ref: Pixabay

Hands-on Tutorials, GUIDE TO SPARK EXECUTION

Filtering a Spark Dataset against a collection of data values is commonly encountered in many data analytics flows. This particular story would explain four different ways to achieve the same.

Let us assume there is a very large Dataset ‘A’ with the following schema:

root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer

The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’ (can be broadcasted to executors), to get a filtered Dataset ‘A`’. The filter operation can be represented as:

A` = A.filter(A.empId contains in 'B')

To achieve this most common filtering scenario, you can use four types of transformation in Spark, each one having its own pros and cons. Here is a description of the usage of all these four transformations to execute this particular filtering scenario along with detailed notes on the reliability and efficiency aspects of each of these. …


Image for post
Image for post
Ref: Pixbay.com

GUIDE TO SPARK EXECUTION

This story is exclusively dedicated to the Join operation in Apache Spark, giving you an overall perspective of the foundation on which Spark Join technology is built upon.

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. …


Image for post
Image for post

Partitioning is one of the basic building blocks on which the Apache Spark framework has been built. Just setting the right partitioning across various stages, a lot of spark programs can be optimized right away.

I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.

During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.

Therefore, based on our experience, knowledge, and research, I and my colleague Naushad decided to write a book focusing just on this one important aspect of Apache Spark, i.e., partitioning. The book’s title, “Guide to Spark Partitioning” is also aligned with this single objective of the book. …


Image for post
Image for post

GUIDE TO SPARK EXECUTION

Most Spark developers spend considerable time in troubleshooting the Fetch Failed Exceptions observed during shuffle operations. This story would serve you the most common causes of a Fetch Failed Exception and would reveal the results of a recent poll conducted on the Exception.

Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:

a) Shuffle Write: Shuffle map tasks write the data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. Bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. …


Image for post
Image for post
Ref: Pixabay

GUIDE TO SPARK EXECUTION

Spark Driver hosted against a Spark application is solely responsible for driving and supervising the parallel execution of the later in a cluster of computing resources. This story focuses on the key components enabling the Spark Driver to perform its duties.

While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of the Spark application. The first thing the main method does is the initialization of the Spark context which in turn hosts the key components of the driver responsible for driving & supervising the cluster execution of the underlying Spark application. …


Image for post
Image for post
Photo by Doran Erickson on Unsplash

GUIDE TO SPARK AGGREGATION

In the context of the recent official announcement on Spark 3.0, Aggregator would now become the default mechanism to perform Custom Aggregation on Datasets as Spark 3.0 addresses the key usability and coexistence concerns in the earlier Aggregator mechanism. Read the story to know the details.

Aggregation operator is heavily used across Spark applications meant for data mining and analytics. Therefore, Spark has provided both, a wide variety of readymade aggregation functions and a framework to built custom aggregation functions. These aggregations functions can be used on Datasets in a variety of ways to derive aggregated results.

With the Custom Aggregation framework, a user can implement a specific aggregation flow for aggregating a set of records. For Custom Aggregation, prior releases of Spark have provided two approaches, first is based on ‘UserDefinedAggregationFunction’ and the second is based on ‘Aggregator’.

The two approaches are explained in detail in a previous story, titled,UDAF and Aggregators: Custom Aggregation Approaches for Datasets in Apache Spark”. Out of the two approaches, ‘UserDefinedAggregationFunction’, also called as UDAF, is introduced first to support custom aggregation on Dataframes, untyped view of Data, in Spark. However, later, with the introduction of Dataset, which supports both typed, and untyped view of Data, the ‘Aggregator’ approach is introduced additionally to support custom aggregation on the typed view of Data. …


Image for post
Image for post
Image Source: Pexels

GUIDE TO SPARK PARTITIONING

Continuing an earlier story on determining the number of partitions at critical transformations, this story would describe the reasoning behind the number of partitions created from the data file(s) in a Spark Application.

The majority of Spark applications source input data for their execution pipeline from a set of data files (in various formats). To facilitate the reading of data from files, Spark has provided dedicated APIs in the context of both, raw RDDs and Datasets. These APIs abstract the reading process from data files to an input RDD or a Dataset with a definite number of partitions. Users can then perform various transformations/actions on these inputs RDDs/Datasets.

Each of the partitions in an input raw RDD or Dataset is mapped to one or more data files, the mapping is done either on a part of a file or the entire file. During the execution of a Spark Job with an input RDD/Dataset in its pipeline, each of the partition of the input RDD/Dataset is computed by reading the data as per the mapping of partition to the data file(s) The computed partition data is then fed to dependent RDDs/Dataset further into the execution pipeline. …


Image for post
Image for post
Ref: https://pixabay.com/photos

GUIDE TO SPARK PARTITIONING

The number of partitions plays a critical role in the execution of Spark applications. This story in two parts would serve as a guide to reason the number of partitions contained in an RDD or dataset

Data in Spark remain always partitioned right after reading from a data source, during intermediate transformation(s), and till the point when an action is performed to produce the desired output. The partitioned data at each stage is represented by a low-level abstraction, called RDD. Programmers can directly use RDDs to write Spark applications. However, optionally, higher-level abstraction (built on top of RDD), called as Dataset, is also available to users to write Spark applications.

Spark execution pipeline is also built around data partitions only. A typical pipeline includes reading of one or more partitions of an input RDD, computing intermediate partition(s) of intermediate RDD(s), and finally applying an action on the computed partition(s) of the desired RDD. …


Image for post
Image for post
Ref

GUIDE TO APACHE SPARK EXECUTION

Stage in Spark represents a logical unit of parallel computation. Many such stages assembled together builds the execution skeleton of a Spark application. This story tries to unravel the concept of Spark stage and describes important related aspects.

A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity, reliability and resiliency to spark application execution. Below are the various important aspects related to Spark Stages:

Stages are created, executed and monitored by DAG scheduler: Every running Spark application has a DAG scheduler instance associated with it. This scheduler create stages in response to submission of a Job, where a Job essentially represents a RDD execution plan (also called as RDD DAG) corresponding to a action taken in a Spark application. Multiple Jobs could be submitted to DAG scheduler if multiple actions are taken in a Spark application. For each of Job submitted to it, DAG scheduler creates one or more stages, builds a stage DAG to list out the stage dependency graph, and then plan execution schedule for the created stages in accordance with stage DAG. Also, the scheduler monitors the status of the stage execution completion which could turn out to be success, partial-success, or failure. …

About

Ajay Gupta

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store