Reputation: 15
I have a dataframe like below
Class| Student| V1| V2| V3| wb
A| Max| 10| 12| 14| 1
A| Ann| 9| 6| 7| 0.9
B| Tom| 6| 7| 10| 0.3
B| Dick| 3| 8| 7| 0.7
C| Dibs| 5| 2| 3| 0.8
C| Mock| 6| 4| 3| 0.6
D| Sunny| 3| 4| 5| 0.9
D| Lock| 8| 3| 6| 1
And i want to calculate the Weighted Mean for V1,V2,V3 grouped by Class the result should be something like below
Class V1_M V2_M V3_M
A 9 8 3
B 5 3 3
C 4 4 3
So far i can separate data frame for each column. But i feel very inefficient
And here is code for 1 variable
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
Matrix = df.groupby(['Class']).apply(wtdavg,var='V2',wb='wb')
print(Matrix)
I am a newbie with 1 week of pandas experience. Thanks in advance.
Max
Upvotes: 1
Views: 5363
Reputation: 19957
#use apply to calculate weighted mean for alll 3 columns in one go.
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x.V1*x.wb)/sum(x.wb), sum(x.V2*x.wb)/sum(x.wb), sum(x.V3*x.wb)/sum(x.wb)]))
#rename columns
df2.columns=['V1_M','V2_M','V3_M']
df2
Out[858]:
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
Update (dynamic list of value columns, i.e. var_cols
)
#put all your variable names in a list (can be copied over from df.columns)
var_cols = ['V1', 'V2', 'V3']
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x[v] * x.wb) / sum(x.wb) for v in var_cols]))
df2.columns = [e+'_M' for e in var_cols]
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
Upvotes: 6
Reputation: 863741
More general solutions:
1.It create weighted mean for all columns without Student
, Class
:
df2 = df.drop('Student', axis=1) \
.groupby('Class') \
.apply(lambda x: x.drop(['Class', 'wb'], axis=1).mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
Or you can define columns for weighted mean:
df2 = df.groupby('Class') \
.apply(lambda x: x[['V1', 'V2', 'V3']].mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
More general is filter all columns starts with V
by filter
:
df2 = df.groupby('Class') \
.apply(lambda x: x.filter(regex='^V').mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
Upvotes: 1
Reputation: 944
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
temp_df = pd.DataFrame()
for column in df.columns:
if df[column].dtype == np.int64:
temp_S = pd.DataFrame( df[column].groupby(df['Class']).mean())
frames = [temp_df, temp_S]
temp_df = pd.concat(frames, axis = 'columns')
print temp_df
Upvotes: 0