LJo
LJo

Reputation: 61

Pandas - Calculate average of columns with condition based on values in other columns

I struggle to create a new column in my data frame, which would be the result of going through each row a data frame and calculating the average based on some conditions. That is how the data frame looks like

ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c 
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7
1, 1, 222, 0.9, 1, 224, 0.4, 1, 555, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7 

I hope to have the following result:

ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c, NEW
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7, 0.7
1, 1, 222, 0.8, 1, 224, 0.4, 1, 555, 0.6, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7, 0.5

The logic is the following.

If 1_a is 1, keep value in 1_c, if not ignore
If 2_a is 1, keep value in 2_c, if not ignore
If 3_a is 1, keep value in 3_c, if not ignore

calculate the average of the kept values for each row and store in column 'NEW'

I tried several ways, but it only works if I have only 1 row in the data frame. If I have more than 1 row, it seems to calculate the mean across the whole data frame. Additionally, I try to optimise the function as I have more 10 of these IF conditions. That is what I tried, but it does not give me the result, I am looking for:

 def test(x):
    a = x[x['1_a']==1]['1_c'].values
    b = x[x['2_a']==1]['2_c'].values
    c = x[x['3_a']==1]['3_c'].values
    xx =np.concatenate((a,b,c), axis=0)
    z = sum(xx)/len(xx)
    x['New_Prob'] = z
    return x
print(test(df))

The result is something like that:

ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c, NEW
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7, 0.6
1, 1, 222, 0.8, 1, 224, 0.4, 1, 555, 0.6, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7, 0.6

Upvotes: 1

Views: 5217

Answers (3)

AlexK
AlexK

Reputation: 3011

Here is a solution that does not require replacing existing values in the dataframe with anything, but assumes that there are only 1's and 0's in the "_a" columns.

This problem essentially becomes a weighted average of "_c" columns, weighted by "_a" columns.

df.columns = df.columns.str.strip()

a_cols = [col for col in df.columns if col.endswith('_a')]
c_cols = [col for col in df.columns if col.endswith('_c')]
#create a list of tuples of column names, i.e., [('1_a', '1_c'), ('2_a', '2_c'), ('3_a', '3_c')]
a_c = list(zip(a_cols,c_cols)) #if using Python 2.x, use zip(a_cols,c_cols)

df['NEW'] = sum([df[x] * df[y] for x,y in a_c]) / sum([df[z] for z in a_cols])

Upvotes: 0

smitsy
smitsy

Reputation: 81

If your columns are in a similar range for both '_a' and '_c', you can simply loop through them;

r = range(1,4)
for i in r:
    df.loc[df["{}_a".format(i)] != 1, "{}_c".format(i)] = np.NaN

df['NEW'] = df[['{}_c'.format(i) for i in r]].mean(axis=1)

Upvotes: 1

Erfan
Erfan

Reputation: 42926

You can simplify this by writing a for loop which goes through each column with suffix _c and _a and conditionally replaces values with NaN using np.where.

In the end you create the NEW column by using pandas.DataFrame.mean

# First we remove the spaces from your column names
df.columns = df.columns.str.strip()

amt_cols_check = 3

for x in range(amt_cols_check):
    df[f'{x+1}_c'] = np.where(df[f'{x+1}_a'] == 1, df[f'{x+1}_c'], np.NaN)

cols_mean = [col for col in df.columns if col.endswith('_c')]
df['NEW'] =  round(df[cols_mean].mean(axis=1), 1)

print(df)
   ID  1_a  1_b  1_c  2_a  2_b  2_c  3_a  3_b  3_c  NEW
0   0    0  145  NaN    0  555  NaN    1  335  0.7  0.7
1   1    1  222  0.9    1  224  0.4    1  555  0.6  0.6
2   3    1  111  0.3    0  222  NaN    1  999  0.7  0.5

note as AlexK noted in the comments. I use f-strings in my answer which are only supported from Python 3.6 and later.

Upvotes: 1

Related Questions