nzn
nzn

Reputation: 31

ReduceByKey on specified value element

New to Spark and trying to understand reduceByKey, which is designated to accept RDD[(K, V)]. What isn't clear to me is how to apply this function when the value is a list/tuple...

After various mapping and filtering operations my RDD has ended up in the form of (Cluster:String, (Unique_ID:String, Count:Int)), in which I can have many elements belonging to the same cluster, e.g.:

 Array((a,(lkn,12)), (a,(hdha,2)), (a,(naa,35)), (b, (cdas,20)) ...)

Now I want to use reduceByKey to find, for each cluster the element with the highest count (so one entry per cluster). In the above example this would be(a,(naa,35)) for cluster a.

I can figure out how to find the maximum per cluster if I have a simple (key, value) pair with reduceByKey and math.max. But I don't understand how to extend this when value represents a list/tuple of values.

Am I using the wrong function here?

Upvotes: 2

Views: 2127

Answers (1)

user6022341
user6022341

Reputation:

You can:

rdd.reduceByKey { case (x, y) => if (x._2 > y._2) x else y }

This:

  • logically partitions data into groups defined by the key (_._1)

    • Keys for "a": (a, [(lkn,12), (hdha,2), (naa,35), ...])
    • Keys for "b": (b, [(cdas,20), ...])
  • reduces values in each group by comparing second element of values ((x._2 > y._2)) and returns one with the higher number.

Upvotes: 3

Related Questions