Reputation: 838
My mapToPair function produces the output below.
(a, 1) (a, 1) (b, 1)
I'm reducing the values using reduceByKey function and the code follows:
private static final Function2<Integer, Integer, Integer> WORDS_REDUCER =
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) throws Exception {
return a + b;
}
};
It is working fine, can someone explains me how this code works when performing for the pair (b, 1) ?
Upvotes: 0
Views: 551
Reputation: 31745
I am not quite clear from the question what it is that you don't understand, but perhaps this will help...
reduceByKey with the function x+y acts as an accumulator, summing the values per key. If you only have one value for a particular key, that value will be the summed result.
Here's an example using PySpark:
testrdd = sc.parallelize((('a', 1), ('a', 1), ('b', 1)))
testrdd = testrdd.reduceByKey(lambda x,y:x+y)
result = testrdd.collect()
print ("result: {}".format(result))
>result: [('a', 2), ('b', 1)]
Upvotes: 1