Reputation: 141
ser = sc.parallelize([1,2,3,4,5])
freq = ser.reduce(lambda x,y : 1+2)
print(freq). #answer is 3
If I run reduce operation by giving constant values, it just gives the sum of those 2 numbers. So in this case, the answer is just 3. While I was expecting it would be (3+3+3+3=12)
as there are 5 elements and the summation would happen 4 times. Not able to understand the internals of reduce here. Any help please?
Upvotes: 1
Views: 232
Reputation: 45339
You're misunderstanding what reduce
does. It does not apply an aggregation operation (which you assume to be sum
for some reason) to a mapping of all elements (which you suppose is what you do with lambda x,y : 1+2
)
Reducing that RDD will, roughly speaking, do something like this:
call your lambda with 1, 2 -> lambda returns 3
carry 3 and call lambda with 3, 3 -> lambda returns 3
carry 3 and call lambda with 3, 4 -> lambda returns 3
carry 3 and call lambda with 3, 5 -> lambda returns 3
The reduce
method returns the last value, which is 3
.
If your intention is to compute 1 + 2
for each element in the RDD, then you need to map and then reduce, something like:
freq = ser.map(lambda x: 1 + 2).reduce(lambda a,b: a+b) #see how reduce works
#which you can rewrite as
freq = ser.map(lambda x: 1 + 2).sum()
But the result of this is 15, not 12 (as there are 5 elements). I don't know any operation that computes a mapping value for each "reduction" step and allows further reduction.
It's likely that is the wrong question to ask, but you can possibly do that by using the map & reduce option above, skipping just one element, although I strongly doubt this is intentional (because the commutative and associative operation of reduce
can be called an arbitrary number of times depending on how the RDD is partitioned).
Upvotes: 2