Count distinct column values for a given set of columns

Question

Imagine a dataframe as shown below:

Id    Date        FileType
1     2020JAN01   HTML
1     2020JAN01   TXT
2     2020FEB01   HTML
2     2020FEB02   HTML

How can I create a column which counts (let's say TotalTypes) the distinct values of "FileType" for a given set of "Id" and "Date". The output should look like as shown below.

Id    Date        FileType    TotalTypes
1     2020JAN01   HTML        2
1     2020JAN01   TXT         2
2     2020FEB01   HTML        1
2     2020FEB02   HTML        1

I want to do this with spark/scala but have not been able to get around with syntax. Any pointers are much appreciated.

Lamanus · Accepted Answer

By using Window and rank you will get the distinct row numbers and take a max value.

import org.apache.spark.sql.expressions.Window

val w1 = Window.partitionBy("Id", "Date").orderBy("FileType")
val w2 = Window.partitionBy("Id", "Date")

df.withColumn("TotalTypes", max(rank().over(w1)).over(w2)).show

+---+---------+--------+----------+
| Id|     Date|FileType|TotalTypes|
+---+---------+--------+----------+
|  1|2020JAN01|    HTML|         2|
|  1|2020JAN01|     TXT|         2|
|  2|2020FEB01|    HTML|         1|
|  2|2020FEB02|    HTML|         1|
+---+---------+--------+----------+

Count distinct column values for a given set of columns

Answers (2)

Related Questions