https://pixabay.com/images

Apache Spark: MapPartitions — A Powerful Narrow Data Transformation

MapPartitions is a powerful transformation available in Spark which programmers would definitely like. It gives them the flexibility to process partitions as a whole by writing custom logic on lines of single-threaded programming. This story today highlights the key benefits of MapPartitions

Apache Spark, on a high level, provides two types of data transformation for use in data analytics programs, the Narrow ones, and the Wide ones. Narrow ones compute a data partition of a Spark Dataset/RDD only from a single partition of a parent Spark Dataset/RDD, while the wide ones compute a data partition of a Spark Dataset/RDD from multiple partitions of a parent Spark Dataset/RDD.

  1. Combined Processing Opportunity: mapPartitions also provide the opportunity to combine map/flatMap operation with a filter operation. This combination would yield in higher efficiency of Spark Job since the overhead of setting up and managing multiple data transformation steps could be avoided.
  2. Efficient Local Aggregation: Since mapPartitions works on the partition level, it gives the opportunity to the user to do aggregation at a partition level. This local aggregation becomes very important in aggregation (reduce) operations of large data sets because it greatly reduces the amount of shuffled data (to be managed). However, the amount of reduction varies from dataset to dataset. In data sets where each data partitions carry multiple entries against the aggregation key(s), there is a significant reduction in the amount of shuffled data due to local aggregation. It is evident that the reduction in shuffle data results in an improvement in efficiency and reliability of reducing operations.
  3. Global Aggregation Opportunity: Similar to the local aggregation opportunity, mapPartitions could provide the user to do an efficient global aggregation by completely avoiding the data shuffling operation. However, this opportunity is available to the user only in cases where the data is partitioned in such a way that all the data records (to be aggregated) against an aggregation key resides in a single partition.
  4. Avoidance of Repetitive Heavy Initialization: mapPartitions also comes to rescue in cases where a similar heavy initialization is must before the processing of each data record residing in a data partition. A usual narrow transformation like map and flatMap in such cases becomes very inefficient since they incur a major overhead of repetitive initialization/de-initialization. However, in the case of mapPartitions usage, this heavy initialization would be executed only once and would suffice for all the data records residing in a partition. An example of a heavy initialization could be the initialization of a DB connection to update/insert a record.
  5. Avoidance of Explicit Filtering Step: Since mapPartitions (in comparison to usual map and flatMap transformation) could return a data collection with a size different than the size of the input data collection (residing on a partition), users can benefit from the avoidance of a subsequent explicit filtering step required to remove the unwanted records/null values that get inserted in the collection in the usual map/flatMap transformation.
  6. Stateful Partition Wise Processing: mapPartitions also provides the power to process a data partition with respect to state/correlation that is local to that partition only. In some cases, we require to process a partition as a whole because processing is dependent on some correlation or state shared between various data records of that partition. In such rare cases, only mapPartitions can rescue the user.

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/