Reputation: 6347

Finding the max value in Spark RDD

From the following, how can I get the tuple with the highest value?

Array[(String, Int)] = Array((a,30),(b,50),(c,20))

In this example the result I want would be (b,50)

Upvotes: 3

Answers (5)

Haroun Mohammedi

Reputation: 2434

If you are new to spark, I should tell you that you have to use Dataframes as much as possible, they have a lot of advantages comparing with RDDs, with Dataframes you can get the max like this:

import spark.implicits._
import org.apache.spark.sql.functions.max
val df = Seq(("a",30),("b",50),("c",20)).toDF("x", "y")
val x = df.sort($"y".desc).first()

Disclaimer: as @Mandy007 noted in the comments, this solution is more computationally expensive speaking because it must be ordered

This should work, it works for me at least. hope this helps you.

Upvotes: 2

vamshi palutla

Reputation: 104

rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2)

we can use maxBy on collect like this

Upvotes: 0

StanislavKo

Reputation: 420

reduce() returns wrong result for me. There are some other options:

val maxTemp2 = rdd.max()(Ordering[Int].on(x=>x._2))
val maxTemp3 = rdd.sortBy[Int](x=>x._2).take(1)(0)

Data

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

Upvotes: 1

mtoto

Reputation: 24198

You could use reduce():

val max_tuple = rdd.reduce((acc,value) => { 
  if(acc._2 < value._2) value else acc})
//max_tuple: (String, Int) = (b,50)

Data

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

Upvotes: 7

Dani

Reputation: 1022

If the elements are always tuples of two elements you could simply:

Array((a,30),(b,50),(c,20)).maxBy(_._2)

As specified in the docs.

Upvotes: 2

Finding the max value in Spark RDD

Answers (5)

Related Questions