cs_stackX
cs_stackX

Reputation: 1527

Yet another Pandas SettingWithCopyWarning question

Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.

for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]

    df_D['count'].iloc[x] = len(df_C) # triggers warning

I've tried:

I know I can suppress the warning, but I don't want to do that.

What am I missing? I know it's probably something obvious.

Many thanks!

Upvotes: 0

Views: 52

Answers (1)

Ben.T
Ben.T

Reputation: 29635

For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.

To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:

pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]

    df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning

Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.

#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})

#your result
for x in range(len(df_A)):
    df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
                    (df_A['age'] <= df_B['age_limits'].iloc[x][1])]
    print (len(df_C))
3
2
1

#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
         & np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
        .sum(0) )
array([3, 2, 1])

so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you

Upvotes: 1

Related Questions