Reputation: 624
I want to delete specific records from a Spark dataframe:
Sample Input:
I have written the below code to filter the dataframe(Which is incorrect):
val Name = List("Rahul","Mahesh","Gaurav")
val Age =List(20,55)
val final_pub_df = df.filter(!col("Name").isin(Name:_*) && !col("Age").isin(Age:_*))
So my question is - How to filter the dataframe for more than one column with specific filter criteria. The dataframe should be filtered on the basis of the combination of Name and Age fields.
Upvotes: 1
Views: 4753
Reputation: 691
Here's the solution. Based on your dataset I formulated problem -
below dataframe has incorrect entries. I want to remove all incorrect records and keep only correct records -
val Friends = Seq(
("Rahul", "99", "AA"),
("Rahul", "20", "BB"),
("Rahul", "30", "BB"),
("Mahesh", "55", "CC"),
("Mahesh", "88", "DD"),
("Mahesh", "44", "FF"),
("Ramu", "30", "FF"),
("Gaurav", "99", "PP"),
("Gaurav", "20", "HH")).toDF("Name", "Age", "City")
Arrays for filtering -
val Name = List("Rahul", "Mahesh", "Gaurav")
val IncorrectAge = List(20, 55)
Dataops -
Friends.filter(!(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))).show
Here's the output -
+------+---+----+
| Name|Age|City|
+------+---+----+
| Rahul| 99| AA|
| Rahul| 30| BB|
|Mahesh| 88| DD|
|Mahesh| 44| FF|
| Ramu| 30| FF|
|Gaurav| 99| PP|
+------+---+----+
You can also do it with help of joins ..
Create a Badrecords df -
val badrecords = Friends.filter(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))
User left_anti join to select Friends minus badrecords -
Friends.alias("left").join(badrecords.alias("right"), Seq("Name", "Age"), "left_anti").show
Here's the output -
+------+---+----+
| Name|Age|City|
+------+---+----+
| Rahul| 99| AA|
| Rahul| 30| BB|
|Mahesh| 88| DD|
|Mahesh| 44| FF|
| Ramu| 30| FF|
|Gaurav| 99| PP|
+------+---+----+
Upvotes: 2
Reputation: 74
I think you may want to flip the not condition .... filter in dataframe is an alias to where clause in sql.
So you want the query to be
df.filter(col("Name").isin(Name:_*) && col("Age").isin(Age:_*))
Upvotes: -1