yfy
yfy

Reputation: 48

Does Apache Spark reduceByKey works partition-wise?

Suppose I have 2 partitions which contains PairRDD[(Int, Int)] and data is partitioned as;

P1 => (1, 1), (1, 2), (2, 2)

P2 => (1, 3), (2, 3), (2, 4)

If I run reduceByKey( (n1, n2) => n1 + n2) ) on my PairRDD[(Int, Int)];

Do I get;

P1 => (1, 3), (2, 2)

P2 => (1, 3), (2, 7)

or;

P1 => (1, 6)

P2 => (2, 9)

In other words does reduceByKey keeps partitioning of the RDD the same? Or as in the title, does reduceByKey works partition-wise?

Upvotes: 2

Views: 1096

Answers (1)

Srinivasarao Daruna
Srinivasarao Daruna

Reputation: 3374

ReduceByKey will give you result (1, 6) and (2, 9). Reduce by key will combine the results first at each partition and then the data will be shuffled to combine globally.

I am not sure what you are trying to understand, but, there is a nice article which gives details about shuffling which explains few important things about data shuffle.

You can check that here

Upvotes: 1

Related Questions