Reputation: 15
I am trying to get value for below reduceByKey function on RDD, however it is not giving the correct result.
scala> val test =sc.parallelize(( (1 to 5).map(x=>("key",x)))).reduceByKey(_-_).collect
res62: Array[(String, Int)] = Array((key,-3))
Then I tried doing following calculation
scala> List (1,2,3,4,5).reduce(_-_)
res65: Int = -13
Is this happening because there is no guaranty of order in RDD operations and so reduce function is getting applied in any order whereas in case of List order is guaranteed so reduce function is behaving correctly.
Upvotes: 0
Views: 238
Reputation: 23788
This is not a bug but an expected behavior. If you open the doc for reduceByKey you may see (emphasis is mine):
Merge the values for each key using an associative and commutative reduce function.
Those two properties are essential for parallelization:
Associativity means that (a ∗ b) ∗ c = a ∗ (b ∗ c)
(where ∗
is operation)
Commutativity means a ∗ b = b ∗ a
Subtraction is neither associative nor commutative. Thus the result of reduceByKey
is undefined.
Actually even Scala standard library GenTraversable.reduce says (again emphasis is mine)
Reduces the elements of this collection or iterator using the specified associative binary operator.
The order in which operations are performed on elements is unspecified and may be nondeterministic.
So the claim "whereas in case of List order is guaranteed so reduce function is behaving correctly" is also false. The order on List
is an implementation details and theoretically might be changed at any time (although in practice this is not likely to happen because of performance considerations).
Just in case you wonder how -3 can be achieved, here is one possible explanation:
(-1 - -2 - -3) - (-4 - -5)
Upvotes: 3