Praveen Kumar B N
Praveen Kumar B N

Reputation: 141

Reduce operation in Spark with constant values gives a constant result irrespective of input

ser = sc.parallelize([1,2,3,4,5])

freq = ser.reduce(lambda x,y : 1+2) 
print(freq). #answer is 3

If I run reduce operation by giving constant values, it just gives the sum of those 2 numbers. So in this case, the answer is just 3. While I was expecting it would be (3+3+3+3=12) as there are 5 elements and the summation would happen 4 times. Not able to understand the internals of reduce here. Any help please?

Upvotes: 1

Views: 232

Answers (1)

ernest_k
ernest_k

Reputation: 45339

You're misunderstanding what reduce does. It does not apply an aggregation operation (which you assume to be sum for some reason) to a mapping of all elements (which you suppose is what you do with lambda x,y : 1+2)

Reducing that RDD will, roughly speaking, do something like this:

call your lambda with 1, 2        -> lambda returns 3
carry 3 and call lambda with 3, 3 -> lambda returns 3
carry 3 and call lambda with 3, 4 -> lambda returns 3
carry 3 and call lambda with 3, 5 -> lambda returns 3

The reduce method returns the last value, which is 3.

If your intention is to compute 1 + 2 for each element in the RDD, then you need to map and then reduce, something like:

freq = ser.map(lambda x: 1 + 2).reduce(lambda a,b: a+b) #see how reduce works
#which you can rewrite as
freq = ser.map(lambda x: 1 + 2).sum()

But the result of this is 15, not 12 (as there are 5 elements). I don't know any operation that computes a mapping value for each "reduction" step and allows further reduction.
It's likely that is the wrong question to ask, but you can possibly do that by using the map & reduce option above, skipping just one element, although I strongly doubt this is intentional (because the commutative and associative operation of reduce can be called an arbitrary number of times depending on how the RDD is partitioned).

Upvotes: 2

Related Questions