Ajay
Ajay

Reputation: 795

Scala - Groupby and Max on pair RDD

I am new in spark scala and want to find the max salary in each department

Dept,Salary
Dept1,1000
Dept2,2000
Dept1,2500
Dept2,1500
Dept1,1700
Dept2,2800

I implemented below code

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf


object MaxSalary {
  val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))

  case class Dept(dept_name : String, Salary : Int)

  val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(","))
  val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt)))
  val a = recs.max()???????
})
}

but stuck how to implement group by and max function. I am using pair RDD.

Thanks

Upvotes: 0

Views: 5580

Answers (2)

philantrovert
philantrovert

Reputation: 10082

This can be done using RDDs with the following code:

val emp = sc.textFile("file:///home/user/Documents/dept.txt")
            .mapPartitionsWithIndex( (idx, row) => if(idx==0) row.drop(1) else row )
            .map(x => (x.split(",")(0).toString, x.split(",")(1).toInt))

val maxSal = emp.reduceByKey(math.max(_,_))

Should give you:

Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))

Upvotes: 6

koiralo
koiralo

Reputation: 23109

If you are using Dataset here is the solution

case class Dept(dept_name : String, Salary : Int)


val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]"))

  val sq = new SQLContext(sc)

  import sq.implicits._
  val file = "resources/ip.csv"

  val data = sc.textFile(file).map(_.split(","))

  val recs = data.map(r => Dept(r(0), r(1).toInt )).toDS()


  recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show()

Output:

+---------+------------+
|dept_name|max_solution|
+---------+------------+
|    Dept2|        2800|
|    Dept1|        2500|
+---------+------------+

Upvotes: 1

Related Questions