Reputation: 2160
I have a dataset ds
like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate lines (note: not the same as distinct(), I do not want to still have a distinct line, but to remove both lines) per keys (id1,id2,id3)
, the expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove line 1 and line 2 because we have 2 values for the key group.
I try to achieve this using:
ds.groupBy(id1,id2,id3).distinct()
But it's not working.
Upvotes: 1
Views: 97
Reputation: 23109
You can use window
function with filter
on count
as below
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
Upvotes: 3