nachiket26
nachiket26

Reputation: 19

spark dataframe collecting specific results

I have two cases classes as below case class EmployeeDetails(id:Long, empName:String, dept:String) case class SalDetails(salary:Long, dept:String)

and created two dataframes out of them and did a average of salary for each department

val spark = SparkSession
  .builder
  .appName("Emp")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

val empDetails=Seq(
  EmployeeDetails(1,"nachiket","IT"),
  EmployeeDetails(2,"sanketa","Admin"),
EmployeeDetails(3,"kedar","IT")).toDF()

val salaryDetails=Seq(SalDetails(120000,"IT"),
  SalDetails(35000,"Admin"),
  SalDetails(300000,"IT")).toDF()

  val commonFields=salaryDetails.join(empDetails,"dept").orderBy("salary")
  val sortedFields= commonFields.groupBy("dept").avg("salary")
  sortedFields.show()

output is similar to below.. so far so gud

+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
|Admin|    35000.0|
|   IT|   210000.0|
+-----+-----------+

as you can see average is calculated for 2 IT department employees and 1 Admin department employees. Along with the above output I need to show another column say "count" with output 1 and 2 for each row

Upvotes: 1

Views: 44

Answers (1)

linusRian
linusRian

Reputation: 340

val sortedFields= commonFields.groupBy("dept").agg(avg("salary"),countDistinct("name"))

will give you the desired result.

However I see a problem with the above logic, since there are multiple entries for 'IT' department in SalDetails, this will yield incorrect results for commonFields. Not sure if that is intended. Rather you can think of having employee_id and salary in the SalDetails and join the two dataframes based on employee ids.

Upvotes: 1

Related Questions