Ref: Pixabay.com

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

What is a Straggler in…

Ref: Pixabay

Hands-on Tutorials, GUIDE TO SPARK EXECUTION

Let us assume there is a very large Dataset ‘A’ with the following schema:

root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer

The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’…

Ref: Pixbay.com

GUIDE TO SPARK EXECUTION

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data…

I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.

During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers…

GUIDE TO SPARK EXECUTION

Shuffle operations are the backbone of almost all Spark Jobs that are aimed at data aggregation, joins, or data restructuring. During a shuffle operation, the data is shuffled across various nodes of the cluster via a two-step process:

a) Shuffle Write: Shuffle map tasks write the data to be shuffled…

Ref: Pixabay

GUIDE TO SPARK EXECUTION

While running a Spark application on a cluster, the driver container, running the application master, is the first one to be launched by the cluster resource manager. Application master, after initializing its components, launches the primary driver thread, in the same container. The driver thread runs the main’s method of…

Photo by Doran Erickson on Unsplash

GUIDE TO SPARK AGGREGATION

Aggregation operator is heavily used across Spark applications meant for data mining and analytics. Therefore, Spark has provided both, a wide variety of readymade aggregation functions and a framework to built custom aggregation functions. …

Image Source: Pexels

GUIDE TO SPARK PARTITIONING

The majority of Spark applications source input data for their execution pipeline from a set of data files (in various formats). To facilitate the reading of data from files, Spark has provided dedicated APIs in the context of both, raw RDDs and Datasets. These APIs abstract the reading process from…

Ref: https://pixabay.com/photos

GUIDE TO SPARK PARTITIONING

Data in Spark remain always partitioned right after reading from a data source, during intermediate transformation(s), and till the point when an action is performed to produce the desired output. The partitioned data at each stage is represented by a low-level abstraction, called RDD. Programmers can directly use RDDs to…

Ref

GUIDE TO APACHE SPARK EXECUTION

A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages. Stages provides modularity…

Ajay Gupta

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store