Galuoises
Galuoises

Reputation: 3283

Spark: GroupBy and collect_list while filtering by another column

I have the following dataframe

+-----+-----+------+
|group|label|active|
+-----+-----+------+
|    a|    1|     y|
|    a|    2|     y|
|    a|    1|     n|
|    b|    1|     y|
|    b|    1|     n|
+-----+-----+------+

I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.

The expected result would be

+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a    | [1,2]   | [1]     | [2]      |
|b    | [1]     | [1]     | []       |
+-----+---------+---------+----------+

I could get easily filter for "y" label by

val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))

and similarly for the "n" value

val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))

but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.

Upvotes: 2

Views: 1941

Answers (2)

Galuoises
Galuoises

Reputation: 3283

Thanks @mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:

df
   .groupBy("group")
   .agg(
        collect_set(when($"active" === "y", $"label")).as("labelyes"), 
        collect_set(when($"active" === "n", $"label")).as("labelno")
       )
.withColumn("diff", array_except($"labelyes", $"labelno"))

Upvotes: 3

mck
mck

Reputation: 42352

You can do a pivot, and use some array functions to get the difference:

val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
    "difference", 
    array_union(
        array_except(col("n"), col("y")), 
        array_except(col("y"), col("n"))
    )
)

df2.show
+-----+---+------+----------+
|group|  n|     y|difference|
+-----+---+------+----------+
|    b|[1]|   [1]|        []|
|    a|[1]|[1, 2]|       [2]|
+-----+---+------+----------+

Upvotes: 2

Related Questions