Guide to Spark Partitioning

I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.

During all these years of architecting numerous Spark Jobs and working with a big team of Spark developers, I noticed, in general, that the comprehensive understanding of various aspects of Spark partitioning lacks among Spark users. Because of this, they lose on the massive optimization opportunities which exist for building reliable, efficient, and scalable Spark Jobs meant for processing larger data sets.

Therefore, based on our experience, knowledge, and research, I and my colleague Naushad decided to write a book focusing just on this one important aspect of Apache Spark, i.e., partitioning. The book’s title, “Guide to Spark Partitioning” is also aligned with this single objective of the book.

Chapter 1 of the book introduces you to the concept of partitioning and its importance. Chapter 2 goes into depth to explain partitioning rules while reading ingested data files. Chapter 3 goes into depth to explain partitioning rules for Spark transformations that affect partitioning structure. Chapter 4 focusses on explicit partitioning APIs including the various re-partition APIs and the coalesce API. Chapter 5, the last chapter, provides details on how the partitions are written on to a permanent storage medium.

Further, the book focusses primarily on the RDD and Dataset representation of the data since the earlier Dataframe representation is now merged with the Dataset in the recent versions of Spark. Also, to aid in understanding the concepts presented in the chapter, we have also provided many examples in every chapter of the book.

The book is available on Kindle, here are the links to get your copy of it:

Hopefully, the book would concretize your understanding of the various aspects of Spark Partitioning in depth. Armed with the knowledge gained from the book, you should be able to set right partitioning in your Spark Jobs for Large Datasets.




Leading Data Engineering Initiatives @ Jio, Apache Spark Specialist, Author, LinkedIn:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

03. Why Immutable Matters?

C++: Complete Developer Guide — Part 1

C++ Fast development reference khizaruddins, mother of programming language

Code Like Her Fellowship II — Weekly #2

Top 9 Python Framworks to Consider in 2021 and Beyond

Debugging composite indexes in MySQL with EXPLAIN

Temporary fix for Let’s Encrypt Certificate has expired Problem

How to teach employees to be independent. Part 2

Retry & Circuit Breaker Patterns in C# with Polly

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajay Gupta

Ajay Gupta

Leading Data Engineering Initiatives @ Jio, Apache Spark Specialist, Author, LinkedIn:

More from Medium

About reading raw json files in spark

Packaging PySpark application using pex and whl.

Apache Spark SQL User Defined Functions..!

PySpark — Getting Dynamic Schema from String