Reputation: 31
New to Spark and trying to understand reduceByKey
, which is designated to accept RDD[(K, V)]. What isn't clear to me is how to apply this function when the value is a list/tuple...
After various mapping and filtering operations my RDD has ended up in the form of (Cluster:String, (Unique_ID:String, Count:Int))
, in which I can have many elements belonging to the same cluster, e.g.:
Array((a,(lkn,12)), (a,(hdha,2)), (a,(naa,35)), (b, (cdas,20)) ...)
Now I want to use reduceByKey
to find, for each cluster the element with the highest count (so one entry per cluster). In the above example this would be(a,(naa,35))
for cluster a
.
I can figure out how to find the maximum per cluster if I have a simple (key, value) pair with reduceByKey
and math.max
. But I don't understand how to extend this when value represents a list/tuple of values.
Am I using the wrong function here?
Upvotes: 2
Views: 2127
Reputation:
You can:
rdd.reduceByKey { case (x, y) => if (x._2 > y._2) x else y }
This:
logically partitions data into groups defined by the key (_._1
)
(a, [(lkn,12), (hdha,2), (naa,35), ...])
(b, [(cdas,20), ...])
reduces values in each group by comparing second element of values ((x._2 > y._2)
) and returns one with the higher number.
Upvotes: 3