Keep only rows with duplicated values from a dataframe column

Question

I'm learning spark with scala. I have a dataframe composed by two columns.

I would like to delete all the rows for which the value in col2 is present only once (2, 3, 4 and 5). Basically, what i'm looking for is to do the opposite of dropDuplicates.

Raphael Roth · Accepted Answer

You can calculate the rows to remove using groupBy and then do a left anti join to filter dem out:

df.join(
  df.groupBy($"col2")
    .agg(count($"col2").as("count"))
    .where($"count"===1),
  Seq("col2"),
  "leftanti"
)

Or alternatively using window-functions:

df

.withColumn("count",count($"col2").over(Window.partitionBy($"col2")))
.where($"count">1).drop($"count")

Keep only rows with duplicated values from a dataframe column

Answers (2)

Related Questions