Reputation: 393
I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.
For example: let's say I have this DataFrame
+-------+----+
|country|date|
+-------+----+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
and I want to filter out country A
and dates 1
and 2
, so that the expected output should be:
+-------+----+
|country|date|
+-------+----+
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
As you can see, I am still keeping country B
with dates 1
and 2
.
I tried to use filter
in the following way
df.filter("country != 'A' and date not in (1,2)")
But the output filters out all dates 1, and 2, which is not what I want.
Thanks.
Upvotes: 3
Views: 5788
Reputation: 42586
Your current condition is
df.filter("country != 'A' and date not in (1,2)")
which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently
What you want is:
df.filter("not (country = 'A' and date in (1,2))")
i.e. "Find the rows with country A and date of 1 or 2, and reject them"
or equivalently:
df.filter("country != 'A' or date not in (1,2)")
i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"
See De Morgan's laws:
not(A or B) = not A and not B
not (A and B) = not A or not B
Upvotes: 5