Spark: GroupBy and collect_list while filtering by another column

Question

I have the following dataframe

+-----+-----+------+
|group|label|active|
+-----+-----+------+
|    a|    1|     y|
|    a|    2|     y|
|    a|    1|     n|
|    b|    1|     y|
|    b|    1|     n|
+-----+-----+------+

I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.

The expected result would be

+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a    | [1,2]   | [1]     | [2]      |
|b    | [1]     | [1]     | []       |
+-----+---------+---------+----------+

I could get easily filter for "y" label by

val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))

and similarly for the "n" value

val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))

but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.

Galuoises · Accepted Answer

Thanks @mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:

df
   .groupBy("group")
   .agg(
        collect_set(when($"active" === "y", $"label")).as("labelyes"), 
        collect_set(when($"active" === "n", $"label")).as("labelno")
       )
.withColumn("diff", array_except($"labelyes", $"labelno"))

Spark: GroupBy and collect_list while filtering by another column

Answers (2)

Related Questions