Reputation: 61
I struggle to create a new column in my data frame, which would be the result of going through each row a data frame and calculating the average based on some conditions. That is how the data frame looks like
ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7
1, 1, 222, 0.9, 1, 224, 0.4, 1, 555, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7
I hope to have the following result:
ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c, NEW
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7, 0.7
1, 1, 222, 0.8, 1, 224, 0.4, 1, 555, 0.6, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7, 0.5
The logic is the following.
If 1_a is 1, keep value in 1_c, if not ignore
If 2_a is 1, keep value in 2_c, if not ignore
If 3_a is 1, keep value in 3_c, if not ignore
calculate the average of the kept values for each row and store in column 'NEW'
I tried several ways, but it only works if I have only 1 row in the data frame. If I have more than 1 row, it seems to calculate the mean across the whole data frame. Additionally, I try to optimise the function as I have more 10 of these IF conditions. That is what I tried, but it does not give me the result, I am looking for:
def test(x):
a = x[x['1_a']==1]['1_c'].values
b = x[x['2_a']==1]['2_c'].values
c = x[x['3_a']==1]['3_c'].values
xx =np.concatenate((a,b,c), axis=0)
z = sum(xx)/len(xx)
x['New_Prob'] = z
return x
print(test(df))
The result is something like that:
ID, 1_a, 1_b, 1_c, 2_a, 2_b, 2_c, 3_a, 3_b, 3_c, NEW
0, 0, 145, 0.8, 0, 555, 0.7, 1, 335, 0.7, 0.6
1, 1, 222, 0.8, 1, 224, 0.4, 1, 555, 0.6, 0.6
3, 1, 111, 0.3, 0, 222, 0.5, 1, 999, 0.7, 0.6
Upvotes: 1
Views: 5217
Reputation: 3011
Here is a solution that does not require replacing existing values in the dataframe with anything, but assumes that there are only 1's and 0's in the "_a" columns.
This problem essentially becomes a weighted average of "_c" columns, weighted by "_a" columns.
df.columns = df.columns.str.strip()
a_cols = [col for col in df.columns if col.endswith('_a')]
c_cols = [col for col in df.columns if col.endswith('_c')]
#create a list of tuples of column names, i.e., [('1_a', '1_c'), ('2_a', '2_c'), ('3_a', '3_c')]
a_c = list(zip(a_cols,c_cols)) #if using Python 2.x, use zip(a_cols,c_cols)
df['NEW'] = sum([df[x] * df[y] for x,y in a_c]) / sum([df[z] for z in a_cols])
Upvotes: 0
Reputation: 81
If your columns are in a similar range for both '_a' and '_c', you can simply loop through them;
r = range(1,4)
for i in r:
df.loc[df["{}_a".format(i)] != 1, "{}_c".format(i)] = np.NaN
df['NEW'] = df[['{}_c'.format(i) for i in r]].mean(axis=1)
Upvotes: 1
Reputation: 42926
You can simplify this by writing a for loop
which goes through each column with suffix _c
and _a
and conditionally replaces values with NaN
using np.where
.
In the end you create the NEW
column by using pandas.DataFrame.mean
# First we remove the spaces from your column names
df.columns = df.columns.str.strip()
amt_cols_check = 3
for x in range(amt_cols_check):
df[f'{x+1}_c'] = np.where(df[f'{x+1}_a'] == 1, df[f'{x+1}_c'], np.NaN)
cols_mean = [col for col in df.columns if col.endswith('_c')]
df['NEW'] = round(df[cols_mean].mean(axis=1), 1)
print(df)
ID 1_a 1_b 1_c 2_a 2_b 2_c 3_a 3_b 3_c NEW
0 0 0 145 NaN 0 555 NaN 1 335 0.7 0.7
1 1 1 222 0.9 1 224 0.4 1 555 0.6 0.6
2 3 1 111 0.3 0 222 NaN 1 999 0.7 0.5
note as AlexK noted in the comments. I use f-strings
in my answer which are only supported from Python 3.6 and later.
Upvotes: 1