Reputation: 2261
In Learning Spark book they write:
For operations that act on a single RDD, such as reduceByKey(), running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master.
However, in the this answer the author is saying, no pre-partitioning is needed because:
For reduceByKey(), the first quality aggregates elements of the same key with the provided associative reduce function locally first on each executor and then eventually aggregated across executors.
So, why does a book suggestes pre-partitioning if reduceByKey() will anyway aggregares elements on each executor first without shuffeling the data?
Upvotes: 2
Views: 501
Reputation: 88
The answer above pretty much summed up thereduceByKey and partitionBy methods.
To Answer your question, you do not need to apply partitionBy before calling reduceByKey.
Upvotes: 1
Reputation: 330393
The book doesn't really suggest pre-partitioning. It only describes the behavior of *ByKey
methods when applied to partitioned RDD
. Considering that partitioning itself is a shuffle, making a conclusion, that you should preemptively partitioned your data for a single reduceByKey
, is unjustified.
In fact if data contains N values with K unique keys and P
partitions, the size of the shuffle in the scenario reduceByKey
∘ partitionBy
is always greater and equal than the size of the shuffle with reduceByKey
alone.
If your going to apply multiple amortized cost of partitionBy
followed by a set of *byKey
or *Join
applications might be lower than the cost of applying *byKey
methods. Similarly if you've already shuffle the data as a part of different operation and you're going to apply shuffling operation later, yous should try to preserve existing partitioning. This however doesn't imply that you should always prefer to partitionBy
first.
Upvotes: 0