Vijay Innamuri
Vijay Innamuri

Reputation: 4372

How to find max value in pair RDD?

I have a spark pair RDD (key, count) as below

Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))

How to find the key with highest count using spark scala API?

EDIT: datatype of pair RDD is org.apache.spark.rdd.RDD[(String, Int)]

Upvotes: 15

Views: 49398

Answers (4)

Jacek Laskowski
Jacek Laskowski

Reputation: 74669

Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))

Quoting the note from RDD.takeOrdered:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

Upvotes: 14

Rubber Duck
Rubber Duck

Reputation: 3723

Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays

strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)

Upvotes: 5

Mayank
Mayank

Reputation: 303

For Pyspark:

Let a be the pair RDD with keys as String and values as integers then

a.max(lambda x:x[1])

returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.

Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.

Note that the max function by itself will order by key and return the max value.

Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max

Upvotes: 11

Sergii Lagutin
Sergii Lagutin

Reputation: 10671

Use Array.maxBy method:

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)

or RDD.max:

val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
  override def compare(x: (String, Int), y: (String, Int)): Int = 
      Ordering[Int].compare(x._2, y._2)
})

Upvotes: 24

Related Questions