user3299166
user3299166

Reputation: 191

Pandas boolean DataFrame selection ambiguity

EDIT: Fixed values in tables.

Let's say I have a pandas dataframe df:

>>>df
                  a         b         c
        0  0.016367  0.289944 -0.891527
        1  1.130206  0.899758 -0.276587
        2  1.390528 -1.472802  0.128979
        3  0.023598 -0.931329  0.158143
        4  1.401183 -0.162357 -0.959156
        5 -0.127765  1.142039 -0.734434

So now I try to do some Boolean indexing:

>>>df[df > 0.5]
          a         b         c
0       NaN       NaN        Nan
1  1.130206  0.899758        NaN
2  1.390528       NaN        NaN
3       NaN       NaN        NaN
4  1.401183       NaN        NaN
5       NaN  1.142039        NaN

>>>df[df < 0]
          a         b         c
0       NaN       NaN -0.891527
1       NaN       NaN -0.276587
2       NaN -1.472802       NaN
3       NaN -0.931329       NaN
4       NaN -0.162357 -0.959156
5 -0.127765       NaN -0.734434

So now I try to do the logical OR of thos to conditions as the indexing condition:

>>>df[df > 0.5 or df < 0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Ben\Anaconda\lib\site-packages\pandas\core\generic.py", line 692, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any()    or a.all().

I've researched this a bit, its a basic feature that the developers of numpy decided certain conditions may be ambiguous depending in the any or all case. What I don't get is why checking if the value is > 0.5 is valid and checking if its < 0 is valid, but checking if its > 0.5 or < 0 is INVALID. I've also tried mixing up the boolean syntax, but this error is in escable. Can some one explain why doing the OR creates an ambiguous case?

Upvotes: 1

Views: 2390

Answers (3)

BrenBarn
BrenBarn

Reputation: 251428

It is not possible for custom types to override the behavior of and and or in Python. That is, it is not possible for Numpy to say that it wants [0, 1, 1] and [1, 1, 0] to be [0, 1, 0]. This is because of how the and operation short-circuits (see the documentation); in essence, the short-circuiting behavior of and and or means that these operations must work as two separate truth values on the two arguments; they cannot combine their two operands in some way that makes use of data in both operands at once (for instance, to compare the elements componentwise, as would be natural for Numpy).

The solution is to use the bitwise operators & and |. However, you do have to be careful with this, since the precedence is not what you might expect.

Upvotes: 3

shx2
shx2

Reputation: 64328

Since the logical operators are not overridable in python, numpy and pandas override the bitwise operators.

This means you need to use the bitwise-or operator:

df[(df > 0.5) | (df < 0)]

Upvotes: 1

EdChum
EdChum

Reputation: 394159

You need to use the bitwise or and put the conditions in parentheses:

df[(df > 0.5) | (df < 0)]

The reason is because it is ambiguous to compare arrays when maybe some of the values in the array satisfy the condition, that is why it becomes ambiguous.

If you called the attribute any then it would evaluate to True.

The parentheses is required due to operator precedence.

Example:

In [23]:

df = pd.DataFrame(randn(5,5))
df
Out[23]:
          0         1         2         3         4
0  0.320165  0.123677 -0.202609  1.225668  0.327576
1 -0.620356  0.126270  1.191855  0.903879  0.214802
2 -0.974635  1.712151  1.178358  0.224962 -0.921045
3 -1.337430 -1.225469  1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846  1.035407  1.766096  1.456888
In [24]:

df[(df > 0.5) | (df < 0)]
Out[24]:
          0         1         2         3         4
0       NaN       NaN -0.202609  1.225668       NaN
1 -0.620356       NaN  1.191855  0.903879       NaN
2 -0.974635  1.712151  1.178358       NaN -0.921045
3 -1.337430 -1.225469  1.150564 -1.618739 -1.297221
4 -0.093164 -0.928846  1.035407  1.766096  1.456888

Upvotes: 0

Related Questions