Luke Nisbet
Luke Nisbet

Reputation: 23

Looping IF statements for each row in a dataframe in pandas

Hi I am new to using pandas coming from a SAS background and I am trying to segment a continuous variable into bands using the following code.

var_range = df['BILL_AMT1'].max() - df['BILL_AMT1'].min()
a= 10
for i in range(1,a):
    inc = var_range/a
    lower_bound = df['BILL_AMT1'].min() + (i-1)*inc
    print('Lower bound is '+str(lower_bound))
    upper_bound = df['BILL_AMT1'].max() + (i)*inc
    print('Upper bound is '+str(upper_bound))
    if (lower_bound <= df['BILL_AMT1'] < upper_bound):
        df['bill_class'] = i
    i+=1

I am expecting the code to check if the value of df['BILL_AMT1'] is within the current loops boundings and set a df['bill_class'] accordingly.

I get the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I think that the if condition is evaluating correctly but the error is due to assigning a new column a value of the for loops counter.

Can anyone explain what is going wrong or suggest an alternative.

Upvotes: 2

Views: 2044

Answers (2)

unutbu
unutbu

Reputation: 879611

To avoid the ValueError, change

if (lower_bound <= df['BILL_AMT1'] < upper_bound):
    df['bill_class'] = i

to

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)
df.loc[mask, 'bill_class'] = i

The chained comparison (lower_bound <= df['BILL_AMT1'] < upper_bound) is equivalent to

(lower_bound <= df['BILL_AMT1']) and (df['BILL_AMT1'] < upper_bound)

The and operator causes the two boolean Series (lower_bound <= df['BILL_AMT1']), (df['BILL_AMT1'] < upper_bound) to be evaluated in a boolean context -- i.e. reduced to a single boolean value. Pandas refuses to reduce Series to a single boolean value.

Instead, to return a boolean Series, use the & operator instead of and:

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)

and then to assign values to the bill_class column where mask is True, use df.loc:

df.loc[mask, 'bill_class'] = i

To bin the data in df['BILL_AMT1'], you could remove the Python for-loop entirely, and as DSM suggests, use pd.cut:

df['bill_class'] = pd.cut(df['BILL_AMT1'], bins=10, labels=False)+1

Upvotes: 2

piRSquared
piRSquared

Reputation: 294278

IIUC, this should be the fix to your code:

mx, mn = df['BILL_AMT1'].max(), df['BILL_AMT1'].min()
rng = mx - mn
a = 10

for i in range(a):
    inc = rng / a
    lower_bound = mn + i * inc
    print('Lower bound is ' + str(lower_bound))
    upper_bound = mn + (i + 1) * inc if i + 1 < a else mx
    print('Upper bound is ' + str(upper_bound))
    ge = df['BILL_AMT1'].ge(lower_bound)
    lt = df['BILL_AMT1'].lt(upper_bound)
    df.loc[ge & lt, 'bill_class'] = i

however
I'd do this

df['bill_class'] = pd.qcut(df['BILL_AMT1'], 10, list(range(10)))

Upvotes: 0

Related Questions