Stark
Stark

Reputation: 624

How to delete/filter the specific rows from a spark dataframe

I want to delete specific records from a Spark dataframe:
Sample Input:
Sample Input

Expected output:
Expected output

Discarded Rows:
Discarded Rows

I have written the below code to filter the dataframe(Which is incorrect):


val Name = List("Rahul","Mahesh","Gaurav")
val Age =List(20,55)

val final_pub_df = df.filter(!col("Name").isin(Name:_*) &&  !col("Age").isin(Age:_*))

So my question is - How to filter the dataframe for more than one column with specific filter criteria. The dataframe should be filtered on the basis of the combination of Name and Age fields.

Upvotes: 1

Views: 4753

Answers (2)

ValaravausBlack
ValaravausBlack

Reputation: 691

Here's the solution. Based on your dataset I formulated problem -

below dataframe has incorrect entries. I want to remove all incorrect records and keep only correct records -

val Friends = Seq(
      ("Rahul", "99", "AA"),
      ("Rahul", "20", "BB"),
      ("Rahul", "30", "BB"),
      ("Mahesh", "55", "CC"),
      ("Mahesh", "88", "DD"),
      ("Mahesh", "44", "FF"),
      ("Ramu", "30", "FF"),
      ("Gaurav", "99", "PP"),
      ("Gaurav", "20", "HH")).toDF("Name", "Age", "City")

Arrays for filtering -

val Name = List("Rahul", "Mahesh", "Gaurav")
val IncorrectAge = List(20, 55)

Dataops -

Friends.filter(!(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))).show

Here's the output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

You can also do it with help of joins ..

Create a Badrecords df -

val badrecords = Friends.filter(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))

User left_anti join to select Friends minus badrecords -

 Friends.alias("left").join(badrecords.alias("right"), Seq("Name", "Age"), "left_anti").show

Here's the output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

Upvotes: 2

Pranav Sawant
Pranav Sawant

Reputation: 74

I think you may want to flip the not condition .... filter in dataframe is an alias to where clause in sql.

So you want the query to be

df.filter(col("Name").isin(Name:_*) && col("Age").isin(Age:_*))

Upvotes: -1

Related Questions