User9102d82
User9102d82

Reputation: 1190

Count distinct column values for a given set of columns

Imagine a dataframe as shown below:

Id    Date        FileType
1     2020JAN01   HTML
1     2020JAN01   TXT
2     2020FEB01   HTML
2     2020FEB02   HTML

How can I create a column which counts (let's say TotalTypes) the distinct values of "FileType" for a given set of "Id" and "Date". The output should look like as shown below.

Id    Date        FileType    TotalTypes
1     2020JAN01   HTML        2
1     2020JAN01   TXT         2
2     2020FEB01   HTML        1
2     2020FEB02   HTML        1

I want to do this with spark/scala but have not been able to get around with syntax. Any pointers are much appreciated.

Upvotes: 0

Views: 177

Answers (2)

Samir Vyas
Samir Vyas

Reputation: 451

This could work

import org.apache.spark.sql.expressions.Window
import spark.implicits._
import org.apache.spark.sql.functions.{collect_set, size}

val window = Window.partitionBy($"Id", $"Date")

df.withColumn("TotalTypes", size(collect_set($"FileType")).over(window))

Upvotes: 1

Lamanus
Lamanus

Reputation: 13541

By using Window and rank you will get the distinct row numbers and take a max value.

import org.apache.spark.sql.expressions.Window

val w1 = Window.partitionBy("Id", "Date").orderBy("FileType")
val w2 = Window.partitionBy("Id", "Date")

df.withColumn("TotalTypes", max(rank().over(w1)).over(w2)).show

+---+---------+--------+----------+
| Id|     Date|FileType|TotalTypes|
+---+---------+--------+----------+
|  1|2020JAN01|    HTML|         2|
|  1|2020JAN01|     TXT|         2|
|  2|2020FEB01|    HTML|         1|
|  2|2020FEB02|    HTML|         1|
+---+---------+--------+----------+

Upvotes: 1

Related Questions