rgamber
rgamber

Reputation: 5849

Understanding reduceByKey function definition Spark Scala

The reduceByKey function for a Pair RDD in spark has the following definition:

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

I understand that reduceByKey takes the argument function applies it to the values of the keys. What I am trying to understand is how to read this definition where the function is taking 2 values as an input i.e. (V, V) => V. Shouldn't it be V => V, just like the mapValues function where the function is applied on the value V to yield U which is a value of same or different type:

def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]

Is this because reduceByKey is applied to all the values (for the same key) at once, and mapValues is applied to each value (irrespective of the key) one at a time?, in which case shouldn't it be defined as something like (V1, V2) => V

Upvotes: 3

Views: 1566

Answers (2)

gilcu2
gilcu2

Reputation: 353

mapValues transform each second part of the pairs in the RDD applying f: (V) ⇒ U while reduceByKey reduce all pair with the same key to one pair applying f: (V, V) => V

val data = Array((1,1),(1,2),(1,4),(3,5),(3,7))
val rdd = sc.parallelize(data)
rdd.mapValues(x=>x+1).collect
// Array((1,2),(1,3),(1,5),(3,6),(3,8))
rdd.reduceByKey(_+_).collect
// Array((1,7),(3,12))

Upvotes: 1

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

...Shouldn't it be V => V, just like the mapValues...

No, they are totally different. Recall that there is an invariant in map functions, they return an Iterable (List, Array, etc.) with the same length as the original list (the mapped one). On the other hand, reduce functions aggregate or combine all elements, in this case the reduceByKey combine pairs or values by the applied function, this definition comes from a mathematical concept called monoid. You can see it this way, you combine the two first elements of a list by the applied function and the result of that operation, which should be of the same type of the first elements, is operated with the third and so on, until you end up having one single element.

Upvotes: 6

Related Questions