Reduce operation in Spark with constant values gives a constant result irrespective of input

Question

ser = sc.parallelize([1,2,3,4,5])

freq = ser.reduce(lambda x,y : 1+2) 
print(freq). #answer is 3

If I run reduce operation by giving constant values, it just gives the sum of those 2 numbers. So in this case, the answer is just 3. While I was expecting it would be (3+3+3+3=12) as there are 5 elements and the summation would happen 4 times. Not able to understand the internals of reduce here. Any help please?

ernest_k · Accepted Answer

You're misunderstanding what reduce does. It does not apply an aggregation operation (which you assume to be sum for some reason) to a mapping of all elements (which you suppose is what you do with lambda x,y : 1+2)

Reducing that RDD will, roughly speaking, do something like this:

call your lambda with 1, 2        -> lambda returns 3
carry 3 and call lambda with 3, 3 -> lambda returns 3
carry 3 and call lambda with 3, 4 -> lambda returns 3
carry 3 and call lambda with 3, 5 -> lambda returns 3

The reduce method returns the last value, which is 3.

If your intention is to compute 1 + 2 for each element in the RDD, then you need to map and then reduce, something like:

freq = ser.map(lambda x: 1 + 2).reduce(lambda a,b: a+b) #see how reduce works
#which you can rewrite as
freq = ser.map(lambda x: 1 + 2).sum()

But the result of this is 15, not 12 (as there are 5 elements). I don't know any operation that computes a mapping value for each "reduction" step and allows further reduction.
It's likely that is the wrong question to ask, but you can possibly do that by using the map & reduce option above, skipping just one element, although I strongly doubt this is intentional (because the commutative and associative operation of reduce can be called an arbitrary number of times depending on how the RDD is partitioned).

Reduce operation in Spark with constant values gives a constant result irrespective of input

Answers (1)

Related Questions