pmanresa93
pmanresa93

Reputation: 393

Remove rows from Spark DataFrame that ONLY satisfy two conditions

I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.

For example: let's say I have this DataFrame

+-------+----+
|country|date|
+-------+----+
|      A|   1|
|      A|   2|
|      A|   3|
|      B|   1|
|      B|   2|
|      B|   3|
+-------+----+

and I want to filter out country A and dates 1 and 2, so that the expected output should be:

+-------+----+
|country|date|
+-------+----+
|      A|   3|
|      B|   1|
|      B|   2|
|      B|   3|
+-------+----+

As you can see, I am still keeping country B with dates 1 and 2.

I tried to use filter in the following way

df.filter("country != 'A' and date not in (1,2)")

But the output filters out all dates 1, and 2, which is not what I want.

Thanks.

Upvotes: 3

Views: 5788

Answers (1)

DNA
DNA

Reputation: 42586

Your current condition is

df.filter("country != 'A' and date not in (1,2)")

which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently

What you want is:

df.filter("not (country = 'A' and date in (1,2))")

i.e. "Find the rows with country A and date of 1 or 2, and reject them"

or equivalently:

df.filter("country != 'A' or date not in (1,2)")

i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"

See De Morgan's laws:

not(A or B) = not A and not B

not (A and B) = not A or not B

Upvotes: 5

Related Questions