Tom
Tom

Reputation: 23

Pandas Boolean Indexing Issue

Can anyone explain the following behaviour. I am expecting all three rows to be returned.

import pandas as pd

test_dict = {
    'col1':[None, None, None],
    'col2':[True, False, True],
    'col3':[True, True, False]
}

df = pd.DataFrame(test_dict)

df[ df.col1 | df.col2 | df.col3 ]
>>> Return only first two rows (index 0 and 1)

Replacing the None values with empty strings using df.fillna('') appears to fix it but I don't understand why the first two rows work fine if None is an issue.

Also changing the order of the comparisons effects it. If I swap col2 and col3 in the mask then the row with index 1 is no longer returned but the row with index 2 is returned. If col1 comes last then all rows are returned.

Upvotes: 2

Views: 192

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

The problem is that the evaluation is from left to right. That is

df.col1 | df.col2 | df.col3 == (df.col1 | df.col2) | df.col3

Now, I think this is an implementation choice in Pandas that None | True is evaluated as False. So in this case (df.col1 | df.col2) is all False. That's why you only see the first to rows.

To fix this. use

df[df.any(axis=1)]

Upvotes: 3

Related Questions