Reputation: 1480
I have a dataset say:
a=1, b=2
a=2, b=2
a=2, b=3
...and I want to drop records where a
has the same value but b
has a different value. In this case dropping both records where a=2
.
I suspect I need to groupBy
for a
then some kind of filtering where b != b
.
Upvotes: 0
Views: 343
Reputation: 4069
I did my solution using scala, just follow the idea:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{Window}
val df = sc.parallelize(Seq(
(1, 2),
(2, 2),
(2, 3)
)).toDF("a", "b")
val window = Window.partitionBy("a")
val newDF = (df
.withColumn("count", count(lit(1)).over(window))
.where(col("count") === lit(1))
.drop("count"))
newDF.show
Output:
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
Upvotes: 1