Marses
Marses

Reputation: 1572

Boolean Indexing with Pandas isn't working for me

I'm getting some strange behaviour in pandas when using Boolean indexing, and I don't understand what's going wrong.

With a DataFrame data that contains a column RSTAR of Float values, among others, I'm getting the following when I try to do boolean indexing:

rejection_list = list( data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'] )

Gives me an error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The following on the other hand:

booll = (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR))
rejection_list2 = list(data[booll].loc[:,'NAME'])

Works fine. As far as I can tell, these two expressions should do the exact same thing. So why does the bottom one work, but not the top one?


UPDATE: Still don't understand what's going on, I looked further into it and here's what happened:

I tried to slice the data DataFrame so that I could post it on here. So with data = data.loc[:5,:] I get the same exact error. However, with data = data.loc[:5, ['RSTAR', 'NAME']] I get no error and it works as it should.

I'm not sure how to post the entire data array here since it's got lots of columns, but the column names are:

data.columns
Index(['Unnamed: 0', 'NAME', 'RADIUS', 'RUPPER', 'RLOWER', 'UR', 'MASS',
       'MASSUPPER', 'MASSLOWER', 'UMASS', 'A', 'AUPPER', 'ALOWER', 'UA',
       'RSTAR', 'RSTARUPPER', 'RSTARLOWER', 'URSTAR', 'TEFF', 'TEFFUPPER',
       'TEFFLOWER', 'UTEFF', 'ECC', 'LUM', 'RERRMAX', 'LOG_FLUX', 'FLUX'],
      dtype='object')

So I can't see any duplication or anything. I just don't understand what's wrong.


UPDATE 2: It got more confusing. So I went into pdb again, like so:

pdb.set_trace() ###
rejection_list = list(data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'])

And keeping the same data, I copy and pasted the exact statement above: rejection_list = list(data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME']) and it worked while in pdb mode. However, as soon as I click c to continue out of pdb into the next line, the same line I just successfully executed in pdb, it gives me the error again. I'm at a complete loss here. Is it something to do with a cache? I opened a new Terminal but it's still giving me the same problem.


UPDATE 3: Tried it with isnull() and notnull() and same problem.

booll = (data.RSTAR==0) | (data.RSTAR.isnull())
data[booll]

works, but the following doesn't:

rejection_list = list(data[ (data.RSTAR == 0) | (data.RSTAR.isnull()) ].loc[:,'NAME'])

UPDATE 4: The opposite works with no problem: data = data[(data.RSTAR != 0) & (data.RSTAR.notnull())].


EDIT: To make it clear, it seems to be the case that when I execute the command by typing it in directly in pdb, it works, for the small and large dataframes. However, when I just let the script run, then it doesn't work for small or large.

Upvotes: 2

Views: 1575

Answers (1)

jezrael
jezrael

Reputation: 862651

I think you can use one line solution with pandas function notnull:

rejection_list = data.ix[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()

or:

rejection_list = data.loc[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()

I try reproduce your error, but all works correctly:

import pandas as pd
import numpy as np

data = pd.DataFrame({'RSTAR':[0,2,-np.inf, np.nan,np.inf],
                     'NAME':[4,5,6,7,10]})

print (data)
   NAME     RSTAR
0     4  0.000000
1     5  2.000000
2     6      -inf
3     7       NaN
4    10       inf

rejection_list = list( data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'])
print (rejection_list)
[4, 6, 7, 10]

booll = (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR))
rejection_list2 = list(data[booll].loc[:,'NAME'])
print (rejection_list2)
[4, 6, 7, 10]

rejection_list3 = data.ix[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()
print (rejection_list2)
[4, 6, 7, 10]

Upvotes: 1

Related Questions