Reputation: 1572
I'm getting some strange behaviour in pandas when using Boolean indexing, and I don't understand what's going wrong.
With a DataFrame data
that contains a column RSTAR
of Float
values, among others, I'm getting the following when I try to do boolean indexing:
rejection_list = list( data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'] )
Gives me an error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The following on the other hand:
booll = (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR))
rejection_list2 = list(data[booll].loc[:,'NAME'])
Works fine. As far as I can tell, these two expressions should do the exact same thing. So why does the bottom one work, but not the top one?
UPDATE: Still don't understand what's going on, I looked further into it and here's what happened:
I tried to slice the data
DataFrame so that I could post it on here. So with data = data.loc[:5,:]
I get the same exact error. However, with data = data.loc[:5, ['RSTAR', 'NAME']]
I get no error and it works as it should.
I'm not sure how to post the entire data
array here since it's got lots of columns, but the column names are:
data.columns
Index(['Unnamed: 0', 'NAME', 'RADIUS', 'RUPPER', 'RLOWER', 'UR', 'MASS',
'MASSUPPER', 'MASSLOWER', 'UMASS', 'A', 'AUPPER', 'ALOWER', 'UA',
'RSTAR', 'RSTARUPPER', 'RSTARLOWER', 'URSTAR', 'TEFF', 'TEFFUPPER',
'TEFFLOWER', 'UTEFF', 'ECC', 'LUM', 'RERRMAX', 'LOG_FLUX', 'FLUX'],
dtype='object')
So I can't see any duplication or anything. I just don't understand what's wrong.
UPDATE 2: It got more confusing. So I went into pdb again, like so:
pdb.set_trace() ###
rejection_list = list(data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'])
And keeping the same data
, I copy and pasted the exact statement above: rejection_list = list(data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'])
and it worked while in pdb mode. However, as soon as I click c
to continue out of pdb into the next line, the same line I just successfully executed in pdb, it gives me the error again. I'm at a complete loss here. Is it something to do with a cache? I opened a new Terminal but it's still giving me the same problem.
UPDATE 3: Tried it with isnull() and notnull() and same problem.
booll = (data.RSTAR==0) | (data.RSTAR.isnull())
data[booll]
works, but the following doesn't:
rejection_list = list(data[ (data.RSTAR == 0) | (data.RSTAR.isnull()) ].loc[:,'NAME'])
UPDATE 4: The opposite works with no problem: data = data[(data.RSTAR != 0) & (data.RSTAR.notnull())]
.
EDIT: To make it clear, it seems to be the case that when I execute the command by typing it in directly in pdb, it works, for the small and large dataframes. However, when I just let the script run, then it doesn't work for small or large.
Upvotes: 2
Views: 1575
Reputation: 862651
I think you can use one line solution with pandas function notnull
:
rejection_list = data.ix[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()
or:
rejection_list = data.loc[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()
I try reproduce your error, but all works correctly:
import pandas as pd
import numpy as np
data = pd.DataFrame({'RSTAR':[0,2,-np.inf, np.nan,np.inf],
'NAME':[4,5,6,7,10]})
print (data)
NAME RSTAR
0 4 0.000000
1 5 2.000000
2 6 -inf
3 7 NaN
4 10 inf
rejection_list = list( data[ (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR)) ].loc[:,'NAME'])
print (rejection_list)
[4, 6, 7, 10]
booll = (data.RSTAR == 0) | (~ np.isfinite(data.RSTAR))
rejection_list2 = list(data[booll].loc[:,'NAME'])
print (rejection_list2)
[4, 6, 7, 10]
rejection_list3 = data.ix[(data.RSTAR == 0) | (data.RSTAR.notnull()) , 'NAME'].tolist()
print (rejection_list2)
[4, 6, 7, 10]
Upvotes: 1