Reputation: 6347
From the following, how can I get the tuple with the highest value?
Array[(String, Int)] = Array((a,30),(b,50),(c,20))
In this example the result I want would be (b,50)
Upvotes: 3
Views: 15421
Reputation: 2434
If you are new to spark, I should tell you that you have to use Dataframe
s as much as possible, they have a lot of advantages comparing with RDD
s, with Dataframe
s you can get the max like this:
import spark.implicits._
import org.apache.spark.sql.functions.max
val df = Seq(("a",30),("b",50),("c",20)).toDF("x", "y")
val x = df.sort($"y".desc).first()
Disclaimer: as @Mandy007 noted in the comments, this solution is more computationally expensive speaking because it must be ordered
This should work, it works for me at least. hope this helps you.
Upvotes: 2
Reputation: 104
rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2)
we can use maxBy on collect like this
Upvotes: 0
Reputation: 420
reduce()
returns wrong result for me. There are some other options:
val maxTemp2 = rdd.max()(Ordering[Int].on(x=>x._2))
val maxTemp3 = rdd.sortBy[Int](x=>x._2).take(1)(0)
Data
val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))
Upvotes: 1
Reputation: 24198
You could use reduce()
:
val max_tuple = rdd.reduce((acc,value) => {
if(acc._2 < value._2) value else acc})
//max_tuple: (String, Int) = (b,50)
Data
val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))
Upvotes: 7