Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.
What is a Straggler in a Spark Application ?:
A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A…
Let us assume there is a very large Dataset ‘A’ with the following schema:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer
The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’ (can be broadcasted to executors), to get a filtered Dataset ‘A`’. The filter operation can be represented as:
A` = A.filter(A.empId contains in 'B')
To achieve this most common filtering scenario, you can use four types of transformation in Spark, each one having its own pros and cons. Here is…
Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.
At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an…
I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.
During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.
Therefore, based on our experience, knowledge…
Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:
a) Shuffle Write: Shuffle map tasks write the data to be shuffled in a disk file, the data is arranged in the file according to shuffle reduce tasks. Bunch of shuffle data corresponding to a shuffle reduce task written by a shuffle map task is called a shuffle block. …
While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of the Spark application. The first thing the main method does is the initialization of the Spark context which in turn hosts the key components of the driver responsible for driving & supervising the cluster execution of the underlying Spark application. …
Aggregation operator is heavily used across Spark applications meant for data mining and analytics. Therefore, Spark has provided both, a wide variety of readymade aggregation functions and a framework to built custom aggregation functions. These aggregations functions can be used on Datasets in a variety of ways to derive aggregated results.
With the Custom Aggregation framework, a user can implement a specific aggregation flow for aggregating a set of records. For Custom Aggregation, prior releases of Spark have provided two approaches, first is based on ‘UserDefinedAggregationFunction’ and the second is based on ‘Aggregator’.
The majority of Spark applications source input data for their execution pipeline from a set of data files (in various formats). To facilitate the reading of data from files, Spark has provided dedicated APIs in the context of both, raw RDDs and Datasets. These APIs abstract the reading process from data files to an input RDD or a Dataset with a definite number of partitions. Users can then perform various transformations/actions on these inputs RDDs/Datasets.
Data in Spark remain always partitioned right after reading from a data source, during intermediate transformation(s), and till the point when an action is performed to produce the desired output. The partitioned data at each stage is represented by a low-level abstraction, called RDD. Programmers can directly use RDDs to write Spark applications. However, optionally, higher-level abstraction (built on top of RDD), called as Dataset, is also available to users to write Spark applications.
Spark execution pipeline is also built around data partitions only. A typical pipeline includes reading of one or more partitions of an input RDD, computing intermediate…
A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity, reliability and resiliency to spark application execution. Below are the various important aspects related to Spark Stages:
Stages are created, executed and monitored by DAG scheduler: Every running Spark application has a DAG scheduler instance associated with it. This scheduler create stages in response to submission of a Job, where…