Reputation: 1190
Imagine a dataframe as shown below:
Id Date FileType
1 2020JAN01 HTML
1 2020JAN01 TXT
2 2020FEB01 HTML
2 2020FEB02 HTML
How can I create a column which counts (let's say TotalTypes
) the distinct values of "FileType" for a given set of "Id" and "Date". The output should look like as shown below.
Id Date FileType TotalTypes
1 2020JAN01 HTML 2
1 2020JAN01 TXT 2
2 2020FEB01 HTML 1
2 2020FEB02 HTML 1
I want to do this with spark/scala but have not been able to get around with syntax. Any pointers are much appreciated.
Upvotes: 0
Views: 177
Reputation: 451
This could work
import org.apache.spark.sql.expressions.Window
import spark.implicits._
import org.apache.spark.sql.functions.{collect_set, size}
val window = Window.partitionBy($"Id", $"Date")
df.withColumn("TotalTypes", size(collect_set($"FileType")).over(window))
Upvotes: 1
Reputation: 13541
By using Window
and rank
you will get the distinct row numbers and take a max value.
import org.apache.spark.sql.expressions.Window
val w1 = Window.partitionBy("Id", "Date").orderBy("FileType")
val w2 = Window.partitionBy("Id", "Date")
df.withColumn("TotalTypes", max(rank().over(w1)).over(w2)).show
+---+---------+--------+----------+
| Id| Date|FileType|TotalTypes|
+---+---------+--------+----------+
| 1|2020JAN01| HTML| 2|
| 1|2020JAN01| TXT| 2|
| 2|2020FEB01| HTML| 1|
| 2|2020FEB02| HTML| 1|
+---+---------+--------+----------+
Upvotes: 1