Reputation: 48
Suppose I have 2 partitions which contains PairRDD[(Int, Int)] and data is partitioned as;
P1 => (1, 1), (1, 2), (2, 2)
P2 => (1, 3), (2, 3), (2, 4)
If I run reduceByKey( (n1, n2) => n1 + n2) ) on my PairRDD[(Int, Int)];
Do I get;
P1 => (1, 3), (2, 2)
P2 => (1, 3), (2, 7)
or;
P1 => (1, 6)
P2 => (2, 9)
In other words does reduceByKey keeps partitioning of the RDD the same? Or as in the title, does reduceByKey works partition-wise?
Upvotes: 2
Views: 1096
Reputation: 3374
ReduceByKey will give you result (1, 6) and (2, 9). Reduce by key will combine the results first at each partition and then the data will be shuffled to combine globally.
I am not sure what you are trying to understand, but, there is a nice article which gives details about shuffling which explains few important things about data shuffle.
You can check that here
Upvotes: 1