ReduceByKey and parititionBy in Spark

Question

In Learning Spark book they write:

For operations that act on a single RDD, such as reduceByKey(), running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master.

However, in the this answer the author is saying, no pre-partitioning is needed because:

For reduceByKey(), the first quality aggregates elements of the same key with the provided associative reduce function locally first on each executor and then eventually aggregated across executors.

So, why does a book suggestes pre-partitioning if reduceByKey() will anyway aggregares elements on each executor first without shuffeling the data?

zero323 · Accepted Answer

The book doesn't really suggest pre-partitioning. It only describes the behavior of *ByKey methods when applied to partitioned RDD. Considering that partitioning itself is a shuffle, making a conclusion, that you should preemptively partitioned your data for a single reduceByKey, is unjustified.

In fact if data contains N values with K unique keys and P partitions, the size of the shuffle in the scenario reduceByKey ∘ partitionBy is always greater and equal than the size of the shuffle with reduceByKey alone.

If your going to apply multiple amortized cost of partitionBy followed by a set of *byKey or *Joinapplications might be lower than the cost of applying *byKey methods. Similarly if you've already shuffle the data as a part of different operation and you're going to apply shuffling operation later, yous should try to preserve existing partitioning. This however doesn't imply that you should always prefer to partitionBy first.

ReduceByKey and parititionBy in Spark

Answers (2)

Related Questions