How to remove a record from Spark DataSet

Question

I am creating a DataSet like this :

SparkSession spark = JavaSparkSessionSingleton.getInstance(javaStreamingContext.sparkContext().getConf());
Dataset journyDF = spark.createDataFrame(journyDataJavaRDD, JournyData.class);

"journyDF" has a column "longitude". If the value of that column is 0 then I want to remove that row from "journyDF". (Skip the row from further processing)

Is there a method which can do that?

DavidW · Accepted Answer

The simplest approach would appear to be Dataset.filter(), so something like

Dataset journyDF = spark.createDataFrame(journyDataJavaRDD, JournyData.class).filter($"longitude" != 0);

or perhaps

[...].filter(col("longitude").notEqual(0));

(You don't specify the type of the column, so you may need to adjust this.)

How to remove a record from Spark DataSet

Answers (1)

Related Questions