A Tale of Two Transformations: Map & MapPartitions in Apache Spark
Spark has provided two very important transformations, viz., Map and MapPartitions, for developers, to accomplish certain custom data processing scenarios. However, due to overlap in their capabilities, there always remains some ambiguity, among the developers, to rightly choose between Map and MapPartitions…
Map and MapPartitions, both, fall in the category of narrow transformations as there is one to one mapping between output and input partitions when both gets executed. Also, both the transformations allow developers to code custom data processing logic on typed records. However, from functionality perspective:
Map, when applied to a Spark Dataset of a certain type, processes one record at a time for each of the input partition of the Dataset. Meaning the processing function provided for the Map is executed for each record separately. The function can return either a NULL, a record of new type or the modified record having the same type as the input record.
On the contrary, MapPartitions processes all records, belonging to single input partition, together. To accomplish this, the input function for the MapPartitions takes an Iterator to the records of a single partition and return either a NULL or an Iterator to a new collection of records, the collection being built inside the function after processing all the input records of the partition. Further, the record type in the returned collection can be same or different than the type of the input record, also, the number of records in the output collection can be different than the total number of input records.
Considering the functionality of both, it is obvious that MapPartitions is more broad and generic than the Map counterpart, as it can be used in both the scenarios, the one where a processing logic needs to be executed independently for each record to produce an output value, or the one where processing logic needs to take into account multiple related records before producing an output value.
Henceforth, one can always use MapPartitions to perform per record processing as being achieved by Map, but let see if the former is performant than the later or not. To get insight into it, lets visualize the DAGs where a data processing logic scans records from a input file, performs identity transformation (Output Record == Input Record) first by Map in one run and then by MapPartitions in the second run, followed by writing the output to a output file.
Here is the DAG when Map is used:
Here is the DAG when MapPartitions is used:
From the DAGs, one can easily figure out that using Map is more performant than the MapPartitions for executing per record processing logic, as Map DAG consists of single WholeStageCodegen step whereas MapPartitions comprises of 4 steps linked via Volcano iterator processing execution model which would perform significantly lower than a single WholeStageCodegen step.
Also, another point to remember is that , MapPartitions require higher proportion of user memory in the executor’s heap to execute per record processing logic as compared to Map transformation since the former builds a intermediate collection of size proportional to the size of corresponding input partition before producing the desired output.
Therefore, for per record processing logic always prefer Map as compared to MapPartitions. However, on the contrary, there could be little benefit of using MapPartitions, instead of Map, in certain cases, where Map is subsequently coupled with a filter transformation to filter certain records/NULLs produced by the Map. This is because of the fact that one can simultaneously perform Map and Filter functionalities in a single MapPartitions itself.
Another thing to note is that, both Map and MapPartitions require deserialization and serialization steps, before and after the processing logic, and therefore incurs performance penalty as compared to functionally similar inbuilt Spark functions because the later directly operates on the tungsten format.
Considering these facts, developers should choose between Map and MapPartitions judiciously. Also, they should prefer these only in cases when the desirable data processing logic cannot be accomplished by any of the Spark built-in function routines. Also, don’t consider the MapPartitions as the magic pill for everything.
In case of feedback or queries on this story, do write in the comments section. Here is the link to my other comprehensive stories on Apache Spark. Also, here is the link to my exclusive book on Spark Partitioning: “Guide to Spark Partitioning: Spark Partitioning Explained in Depth”