Reputation: 3283
I have the following dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.
The expected result would be
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.
Upvotes: 2
Views: 1941
Reputation: 3283
Thanks @mck for his help. I have found an alternative way to solve the question, namely to filter with when
during the aggregation:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))
Upvotes: 3
Reputation: 42352
You can do a pivot, and use some array functions to get the difference:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Upvotes: 2