How does Apache Spark implement the shuffle phase?

Question

I was wondering how does Apache Spark implements the shuffle phase. Does it use the same technique as in MapReduce ? For example :

rddB = rddA.map1.groupByKey();
rddX = rddB.map2.map3.saveAsTextFile();
rddY = rddB.map4.map5.saveAsTextFile();

Does it perform map1 then partitions by key and saves the intermediate data on disk (memory)?

Then reads the intermediate files 2 times, once for the map2 map3 branch and a second time for map4 map5 without calculating rddB again even though we did not do an implicit cache on rddB ?

mgaido · Accepted Answer

No, Spark behaves in a slightly different way. First of all, Spark doesn't perform actually the operation when the line you have written is encountered, but it creates a DAG of the operations to be performed to obtain a given RDD or result. In fact, Spark's operations are split in two main categories: transformations and actions; it executes them only when an action is encountered.

Moreover, Spark stores intermediate results only when you tell it to do so, i.e. when you invoke persist or cache on a RDD. If you don't do that, it will perform all the operation to obtain a given result up to the root of the DAG (i.e. reading them from the files).

The previous statement is not really true. In fact, the manual says here

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

How does Apache Spark implement the shuffle phase?

Answers (1)

Related Questions