Reputation: 19
I have two cases classes as below case class EmployeeDetails(id:Long, empName:String, dept:String) case class SalDetails(salary:Long, dept:String)
and created two dataframes out of them and did a average of salary for each department
val spark = SparkSession
.builder
.appName("Emp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val empDetails=Seq(
EmployeeDetails(1,"nachiket","IT"),
EmployeeDetails(2,"sanketa","Admin"),
EmployeeDetails(3,"kedar","IT")).toDF()
val salaryDetails=Seq(SalDetails(120000,"IT"),
SalDetails(35000,"Admin"),
SalDetails(300000,"IT")).toDF()
val commonFields=salaryDetails.join(empDetails,"dept").orderBy("salary")
val sortedFields= commonFields.groupBy("dept").avg("salary")
sortedFields.show()
output is similar to below.. so far so gud
+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
|Admin| 35000.0|
| IT| 210000.0|
+-----+-----------+
as you can see average is calculated for 2 IT department employees and 1 Admin department employees. Along with the above output I need to show another column say "count" with output 1 and 2 for each row
Upvotes: 1
Views: 44
Reputation: 340
val sortedFields= commonFields.groupBy("dept").agg(avg("salary"),countDistinct("name"))
will give you the desired result.
However I see a problem with the above logic, since there are multiple entries for 'IT' department in SalDetails, this will yield incorrect results for commonFields. Not sure if that is intended. Rather you can think of having employee_id and salary in the SalDetails and join the two dataframes based on employee ids.
Upvotes: 1