hawarden_
hawarden_

Reputation: 2160

Spark: remove all duplicated lines

I have a dataset ds like this:

ds.show():

id1 | id2 | id3 | value  |
1   | 1   | 2   | tom    |
1   | 1   | 2   | tim    |
1   | 3   | 2   | tom    |
1   | 3   | 2   | tom    |
2   | 1   | 2   | mary   |

I want to remove all duplicate lines (note: not the same as distinct(), I do not want to still have a distinct line, but to remove both lines) per keys (id1,id2,id3), the expected output is:

id1 | id2 | id3 | value  |
1   | 3   | 2   | tom    |
2   | 1   | 2   | mary   |

here I should remove line 1 and line 2 because we have 2 values for the key group.

I try to achieve this using:

ds.groupBy(id1,id2,id3).distinct()

But it's not working.

Upvotes: 1

Views: 97

Answers (1)

koiralo
koiralo

Reputation: 23109

You can use window function with filter on count as below

val df = Seq(
  (1, 1, 2, "tom"),
  (1, 1, 2, "tim"),
  (1, 3, 2, "tom"),
  (2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")

val window = Window.partitionBy("id1", "id2", "id3")

df.withColumn("count", count("value").over(window))
  .filter($"count" < 2)
  .drop("count")
  .show(false)

Output:

+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1  |3  |2  |tom  |
|2  |1  |2  |mary |
+---+---+---+-----+

Upvotes: 3

Related Questions