Reputation: 11
I have a Spark dataframe and I want to select few rows/records from them based on a matching value for a particular column. I guess I can do this using Filter operation or select operation in a map transformation.
But , i want to update a status column against those rows/records which has not been selected on applying filter.
On applying filter operation , I am getting back in response a new dataframe consisting of matching records.
So, How to know & update the column value of rows which are not selected?
Upvotes: 1
Views: 1032
Reputation: 1912
On applying filter operaiton, you get the new Dataframe cosisting of matching records.
Then, you can use except function in Scala to get the Un-matching records from the input dataframe.
scala> val inputDF = Seq(("a", 1),("b", 2), ("c", 3), ("d", 4), ("e", 5)).toDF("id", "count")
inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int]
scala> val filterDF = inputDF.filter($"count" > 3)
filterDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int]
scala> filterDF.show()
+---+-----+
| id|count|
+---+-----+
| d| 4|
| e| 5|
+---+-----+
scala> val unmatchDF = inputDF.except(filterDF)
unmatchDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int]
scala> unmatchDF.show()
+---+-----+
| id|count|
+---+-----+
| b| 2|
| a| 1|
| c| 3|
+---+-----+
In PySpark you can achieve the same with subtract function.
Upvotes: 1