abc_spark
abc_spark

Reputation: 383

CountDistinct based on a condition in spark scala

I have the below dataframe with me .

 +-------+---+----+
|Company|EMP|Flag|
+-------+---+----+
|      M| c1|   Y|
|      M| c1|   Y|
|      M| c2|   N|
|      M| c2|   N|
|      M| c3|   Y|
|      M| c3|   Y|
|      M| c4|   N|
|      M| c4|   N|
|      M| c5|   Y|
|      M| c5|   Y|
|      M| c6|   Y|
+-------+---+----+

Created by -

val df1=Seq(
("M","c1","Y"),
("M","c1","Y"),
("M","c2","N"),
("M","c2","N"),
("M","c3","Y"),
("M","c3","Y"),
("M","c4","N"),
("M","c4","N"),
("M","c5","Y"),
("M","c5","Y"),
("M","c6","Y")
)toDF("Company","EMP","Flag")

How can I take the distinct count EMP when FLAG =Y and FLAG = N. Once, an EMP has got a flag, it wont change again.I can achieve this with distinct. But is there any way to achive this without distinct(this is to avoid an extra join in the code)

Expected Output :

+---+---+---+---------+----------+
|  M|  Y|  N|Total_ROWs|Unique_Emp|
+---+---+---+---------+----------+
|  M|  4|  2|       11|         6|
+---+---+---+---------+----------+

Upvotes: 0

Views: 483

Answers (1)

Lamanus
Lamanus

Reputation: 13551

What about this?

df1.groupBy("Company", "EMP", "Flag")
   .agg(count("Company").as("Row"))
   .groupBy("Company", "EMP", "Flag")
   .agg(count("Flag").as("YN"), sum("Row").as("Row"))
   .groupBy("Company")
   .agg(count(when($"Flag" === "Y", 1)).as("Y"), count(when($"Flag" === "N", 1)).as("N"), sum("Row").as("Total_ROWs"), count("EMP").as("Unique_EMP"))
   .show

+-------+---+---+----------+----------+
|Company|  Y|  N|Total_ROWs|Unique_EMP|
+-------+---+---+----------+----------+
|      M|  4|  2|        11|         6|
+-------+---+---+----------+----------+

Upvotes: 1

Related Questions