Neil
Neil

Reputation: 35

Keep only rows with duplicated values from a dataframe column

I'm learning spark with scala. I have a dataframe composed by two columns.

col1  col2
    a 1 
    b 1 
    b 2 
    c 1 
    c 3 
    b 4 
    d 5 

I would like to delete all the rows for which the value in col2 is present only once (2, 3, 4 and 5). Basically, what i'm looking for is to do the opposite of dropDuplicates.

Upvotes: 2

Views: 1702

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

You can calculate the rows to remove using groupBy and then do a left anti join to filter dem out:

df.join(
  df.groupBy($"col2")
    .agg(count($"col2").as("count"))
    .where($"count"===1),
  Seq("col2"),
  "leftanti"
)

Or alternatively using window-functions:

df

.withColumn("count",count($"col2").over(Window.partitionBy($"col2")))
.where($"count">1).drop($"count")

Upvotes: 3

Tzach Zohar
Tzach Zohar

Reputation: 37832

Here is one way which involves Window Functions. The idea here would be to use a Window ordered by col2 and check adjacent records - if the previous or next record has same col2 value - keep the record:

import org.apache.spark.sql.functions._
import spark.implicits._

val window = Window.orderBy("col2")

val result = df
  .withColumn("prev", lag($"col2", 1).over(window))
  .withColumn("next", lead($"col2", 1).over(window))
  .where($"prev" === $"col2" or $"next" === $"col2")
  .drop("prev", "next")

Upvotes: 1

Related Questions