Giorgio
Giorgio

Reputation: 1093

Understanding shuffle managers in Spark

Let me help to clarify about shuffle in depth and how Spark uses shuffle managers. I report some very helpful resources:

https://trongkhoanguyenblog.wordpress.com/

https://0x0fff.com/spark-architecture-shuffle/

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md

Reading them, I understood there are different shuffle managers. I want to focus about two of them: hash manager and sort manager(which is the default manager).

For expose my question, I want to start from a very common transformation:

val rdd = reduceByKey(_ + _)

This transformation causes map-side aggregation and then shuffle for bringing all the same keys into the same partition.

My questions are:

Upvotes: 15

Views: 3201

Answers (1)

user7337271
user7337271

Reputation: 1712

It follows the description of reduceByKey step-by-step:

  1. reduceByKey calls combineByKeyWithTag, with identity combiner and identical merge value and create value
  2. combineByKeyWithClassTag creates an Aggregator and returns ShuffledRDD. Both "map" and "reduce" side aggregations use internal mechanism and don't utilize mapPartitions.
  3. Agregator uses ExternalAppendOnlyMap for both combineValuesByKey ("map side reduction") and combineCombinersByKey ("reduce side reduction")
  4. Both methods use ExternalAppendOnlyMap.insertAllMethod
  5. ExternalAppendOnlyMap keeps track of spilled parts and the current in-memory map (SizeTrackingAppendOnlyMap)
  6. insertAll method updates in-memory map and checks on insert if size estimated size of the current map exceeds the threshold. It uses inherited Spillable.maybeSpill method. If threshold is exceeded this method calls spill as a side effect, and insertAll initializes clean SizeTrackingAppendOnlyMap
  7. spill calls spillMemoryIteratorToDisk which gets DiskBlockObjectWriter object from the block manager.

insertAll steps are applied for both map and reduce side aggregations with corresponding Aggregator functions with shuffle stage in between.

As of Spark 2.0 there is only sort based manager: SPARK-14667

Upvotes: 6

Related Questions