Reputation: 35
I'm learning spark with scala. I have a dataframe composed by two columns.
col1 col2
a 1
b 1
b 2
c 1
c 3
b 4
d 5
I would like to delete all the rows for which the value in col2 is present only once (2, 3, 4 and 5). Basically, what i'm looking for is to do the opposite of dropDuplicates.
Upvotes: 2
Views: 1702
Reputation: 27373
You can calculate the rows to remove using groupBy
and then do a left anti join to filter dem out:
df.join(
df.groupBy($"col2")
.agg(count($"col2").as("count"))
.where($"count"===1),
Seq("col2"),
"leftanti"
)
Or alternatively using window-functions:
df
.withColumn("count",count($"col2").over(Window.partitionBy($"col2")))
.where($"count">1).drop($"count")
Upvotes: 3
Reputation: 37832
Here is one way which involves Window Functions. The idea here would be to use a Window ordered by col2
and check adjacent records - if the previous or next record has same col2
value - keep the record:
import org.apache.spark.sql.functions._
import spark.implicits._
val window = Window.orderBy("col2")
val result = df
.withColumn("prev", lag($"col2", 1).over(window))
.withColumn("next", lead($"col2", 1).over(window))
.where($"prev" === $"col2" or $"next" === $"col2")
.drop("prev", "next")
Upvotes: 1