blankface
blankface

Reputation: 6347

Finding the max value in Spark RDD

From the following, how can I get the tuple with the highest value?

Array[(String, Int)] = Array((a,30),(b,50),(c,20))

In this example the result I want would be (b,50)

Upvotes: 3

Views: 15421

Answers (5)

Haroun Mohammedi
Haroun Mohammedi

Reputation: 2434

If you are new to spark, I should tell you that you have to use Dataframes as much as possible, they have a lot of advantages comparing with RDDs, with Dataframes you can get the max like this:

import spark.implicits._
import org.apache.spark.sql.functions.max
val df = Seq(("a",30),("b",50),("c",20)).toDF("x", "y")
val x = df.sort($"y".desc).first()

Disclaimer: as @Mandy007 noted in the comments, this solution is more computationally expensive speaking because it must be ordered

This should work, it works for me at least. hope this helps you.

Upvotes: 2

vamshi palutla
vamshi palutla

Reputation: 104

rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2)

we can use maxBy on collect like this

Upvotes: 0

StanislavKo
StanislavKo

Reputation: 420

reduce() returns wrong result for me. There are some other options:

val maxTemp2 = rdd.max()(Ordering[Int].on(x=>x._2))
val maxTemp3 = rdd.sortBy[Int](x=>x._2).take(1)(0)

Data

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

Upvotes: 1

mtoto
mtoto

Reputation: 24198

You could use reduce():

val max_tuple = rdd.reduce((acc,value) => { 
  if(acc._2 < value._2) value else acc})
//max_tuple: (String, Int) = (b,50)

Data

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

Upvotes: 7

Dani
Dani

Reputation: 1022

If the elements are always tuples of two elements you could simply:

Array((a,30),(b,50),(c,20)).maxBy(_._2)

As specified in the docs.

Upvotes: 2

Related Questions