Wojciech Walczak
Wojciech Walczak

Reputation: 3599

pandas: multiple conditions while indexing data frame - unexpected behavior

I am filtering rows in a dataframe by values in two columns.

For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.

My test code:

df = pd.DataFrame({'a': range(5), 'b': range(5) })

# let's insert some -1 values
df['a'][1] = -1
df['b'][1] = -1
df['a'][3] = -1
df['b'][4] = -1

df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a != -1) | (df.b != -1)]

print(pd.concat([df, df1, df2], axis=1,
                keys = [ 'original df', 'using AND (&)', 'using OR (|)',]))

And the result:

      original df      using AND (&)      using OR (|)    
             a  b              a   b             a   b
0            0  0              0   0             0   0
1           -1 -1            NaN NaN           NaN NaN
2            2  2              2   2             2   2
3           -1  3            NaN NaN            -1   3
4            4 -1            NaN NaN             4  -1

[5 rows x 6 columns]

As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them. I would expect exactly the opposite result. Could anyone explain this behavior?

I am using pandas 0.13.1.

Upvotes: 256

Views: 703917

Answers (5)

cottontail
cottontail

Reputation: 23151

Use parenthesis

If you arrived at this page because the filtering operation didn't give the correct answer even though the conditions are logically correct, then the first thing to check is whether you used parenthesis to separate conditions.

For example, if you wanted to filter out rows where the values in columns 'a' and 'b' are not equal to -1, then writing the following code

df[df['a'] != -1 & df['b'] != -1]      # <--- forgot parenthesis

would produce a completely unexpected output simply because &/| have higher precedence than comparison operators like !=/== etc. You can get the correct output by evaluating each condition separately via parentheses:

df[(df['a'] != -1) & (df['b'] != -1)]  # <--- used parentheses

N.B. @Pedro's answer which uses query() eliminates this need because in the numerical expression evaluated in query, comparison operators are in fact evaluated before and/or etc.


Writing correct logical expressions

By de Morgan's laws, (i) the negation of a union is the intersection of the negations, and (ii) the negation of an intersection is the union of the negations, i.e.,

A AND B <=> not A OR not B
A OR B  <=> not A AND not B

If the aim is to

drop every row in which at least one value equals -1

you can either use AND operator to identify the rows to keep or use OR operator to identify the rows to drop.

# select rows where both a and b values are not equal to -1
df2_0 = df[df['a'].ne(-1) & df['b'].ne(-1)]

# index of rows where at least one of a or b equals -1
idx = df.index[df.eval('a == -1 or b == -1')]
# drop `idx` rows
df2_1 = df.drop(idx)

df2_0.equals(df2_1) # True

On the other hand, if the aim is to

drop every row in which both values equal -1

you do the exact opposite; either use OR operator to identify the rows to keep or use AND operator to identify the rows to drop.

Upvotes: 3

Pedro Lobito
Pedro Lobito

Reputation: 98921

Late answer, but you can also use query(), e.g. :

df_filtered = df.query('a == 4 & b != 2')

Upvotes: 79

Mohammad Rahimi
Mohammad Rahimi

Reputation: 59

You can try the following:

df1 = df[(df['a'] != -1) & (df['b'] != -1)]       

Upvotes: 1

Jake
Jake

Reputation: 1670

A little mathematical logic theory here:

"NOT a AND NOT b" is the same as "NOT (a OR b)", so:

"a NOT -1 AND b NOT -1" is equivalent of "NOT (a is -1 OR b is -1)", which is opposite (Complement) of "(a is -1 OR b is -1)".

So if you want exact opposite result, df1 and df2 should be as below:

df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a == -1) | (df.b == -1)]

Upvotes: 23

DSM
DSM

Reputation: 353099

As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them.

That's right. Remember that you're writing the condition in terms of what you want to keep, not in terms of what you want to drop. For df1:

df1 = df[(df.a != -1) & (df.b != -1)]

You're saying "keep the rows in which df.a isn't -1 and df.b isn't -1", which is the same as dropping every row in which at least one value is -1.

For df2:

df2 = df[(df.a != -1) | (df.b != -1)]

You're saying "keep the rows in which either df.a or df.b is not -1", which is the same as dropping rows where both values are -1.

PS: chained access like df['a'][1] = -1 can get you into trouble. It's better to get into the habit of using .loc and .iloc.

Upvotes: 396

Related Questions