In Apache Spark Java how can I remove elements from a dataset where some field does not match

Question

I have a dataset say:

a=1, b=2
a=2, b=2
a=2, b=3

...and I want to drop records where a has the same value but b has a different value. In this case dropping both records where a=2.

I suspect I need to groupBy for a then some kind of filtering where b != b.

Kafels · Accepted Answer

I did my solution using scala, just follow the idea:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{Window}

val df = sc.parallelize(Seq(
  (1, 2),
  (2, 2),
  (2, 3)
)).toDF("a", "b")

val window = Window.partitionBy("a")
val newDF = (df
             .withColumn("count", count(lit(1)).over(window))
             .where(col("count") === lit(1))
             .drop("count"))

newDF.show

Output:

+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+

In Apache Spark Java how can I remove elements from a dataset where some field does not match

Answers (1)

Related Questions