Reputation: 7448
I have the following df
,
year code col1 col2
2019 1 2 3
2019 1 3 5
2019 1 2 4
2018 2 1 4
2018 2 2 6
I want to groupby
df
by year
and code
, then sum the differences between col2
and col1
, and then averages the sum over the group size;
df.apply(lambda row: (row['col_2'] - row['col_1']).mean(level=[0, 1]).reset_index(name='avg_num')
this code seems to calculate the mean of differences rather than summing the differences and divided by the group size, so how to fix this?
year code col1 col2 avg_num
2019 1 2 3 1.66
2019 1 3 5 1.66
2019 1 2 4 1.66
2018 2 1 4 3.5
2018 2 2 6 3.5
Upvotes: 1
Views: 1267
Reputation: 3770
data = df.groupby(['year','code']).apply(lambda x: sum(list(abs(x.col2 - x.col1))) / len(list(abs(x.col2 - x.col1)))).reset_index().rename({0:'avg_num'},axis=1)
df.merge(data, on='year').drop('code_y', axis=1).rename({'code_x':'code'}, axis=1)
Output
year code col1 col2 avg_num
0 2019 1 2 3 1.666667
1 2019 1 3 5 1.666667
2 2019 1 2 4 1.666667
3 2018 2 1 4 3.500000
4 2018 2 2 6 3.500000
Upvotes: 1
Reputation: 862681
Use GroupBy.transform
for new column filled by aggregate values:
df['avg_num'] = (df.assign(avg_num=df.col2 - df.col1)
.groupby(['year', 'code'])['avg_num']
.transform('mean').round(2))
print (df)
year code col1 col2 avg_num
0 2019 1 2 3 1.67
1 2019 1 3 5 1.67
2 2019 1 2 4 1.67
3 2018 2 1 4 3.50
4 2018 2 2 6 3.50
Another solution without assign
, but necessary pass columns to groupby
like df['year']
and df['code']
:
df['avg_num'] = ((df['col2'] - df['col1']).groupby([df['year'], df['code']])
.transform('mean').round(2))
print (df)
year code col1 col2 avg_num
0 2019 1 2 3 1.67
1 2019 1 3 5 1.67
2 2019 1 2 4 1.67
3 2018 2 1 4 3.50
4 2018 2 2 6 3.50
Upvotes: 3
Reputation: 14216
We can create an intermediate table to hold the aggregated values and then join it back to the original DataFrame.
aggs = df.assign(avg_num=df.col2 - df.col1) \
.groupby(['year', 'code'], as_index=False)['avg_num'].mean().round(2)
print(aggs)
year code avg_num
0 2018 2 3.50
1 2019 1 1.67
df.merge(aggs, how='inner', left_on=['year', 'code'], right_on=['year', 'code'])
year code col1 col2 avg_num
0 2019 1 2 3 1.67
1 2019 1 3 5 1.67
2 2019 1 2 4 1.67
3 2018 2 1 4 3.50
4 2018 2 2 6 3.50
Upvotes: 1