Ref: Pixabay.com

Stragglers in your Spark Application affect the overall application performance and waste premium resources.


Ref: Pixabay

Hands-on Tutorials, GUIDE TO SPARK EXECUTION

Filtering a Spark Dataset against a collection of data values is commonly encountered in many data analytics flows. This particular story would explain four different ways to achieve the same.

root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer


Ref: Pixbay.com

GUIDE TO SPARK EXECUTION

This story is exclusively dedicated to the Join operation in Apache Spark, giving you an overall perspective of the foundation on which Spark Join technology is built upon.


Partitioning is one of the basic building blocks on which the Apache Spark framework has been built. Just setting the right partitioning across various stages, a lot of spark programs can be optimized right away.


GUIDE TO SPARK EXECUTION

Most Spark developers spend considerable time in troubleshooting the Fetch Failed Exceptions observed during shuffle operations. This story would serve you the most common causes of a Fetch Failed Exception and would reveal the results of a recent poll conducted on the Exception.


Ref: Pixabay

GUIDE TO SPARK EXECUTION

Spark Driver hosted against a Spark application is solely responsible for driving and supervising the parallel execution of the later in a cluster of computing resources. This story focuses on the key components enabling the Spark Driver to perform its duties.


Photo by Doran Erickson on Unsplash

GUIDE TO SPARK AGGREGATION

In the context of the recent official announcement on Spark 3.0, Aggregator would now become the default mechanism to perform Custom Aggregation on Datasets as Spark 3.0 addresses the key usability and coexistence concerns in the earlier Aggregator mechanism. Read the story to know the details.


Image Source: Pexels

GUIDE TO SPARK PARTITIONING

Continuing an earlier story on determining the number of partitions at critical transformations, this story would describe the reasoning behind the number of partitions created from the data file(s) in a Spark Application.


Ref: https://pixabay.com/photos

GUIDE TO SPARK PARTITIONING

The number of partitions plays a critical role in the execution of Spark applications. This story in two parts would serve as a guide to reason the number of partitions contained in an RDD or dataset


Ref

GUIDE TO APACHE SPARK EXECUTION

Stage in Spark represents a logical unit of parallel computation. Many such stages assembled together builds the execution skeleton of a Spark application. This story tries to unravel the concept of Spark stage and describes important related aspects.

Ajay Gupta

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store