Reputation: 8247
I have following dataframe in pandas
key time outlier
1_2 4 False
1_2 2 True
1_2 2 True
1_2 5 True
1_2 6 False
1_3 10 False
1_3 12 False
1_3 10 True
1_3 20 True
I want to calculate mean, median and count at every key
level where outlier
is True
. My desired dataframe is as follows
key time outlier outlier_avg outlier_median outlier_count
1_2 4 False 3 2 3
1_2 2 True 3 2 3
1_2 2 True 3 2 3
1_2 5 True 3 2 3
1_2 6 False 3 2 3
1_3 10 False 15 15 2
1_3 12 False 15 15 2
1_3 10 True 15 15 2
1_3 20 True 15 15 2
Mean, median and count will be duplicated across key which is fine. I am doing following thing in pandas
data.groupby('key').filter(lambda x : x['outlier'] == True).agg({'time':['mean','median', 'count'])
Upvotes: 1
Views: 236
Reputation: 862691
First filter by boolean indexing
, then aggregate, rename columns names by DataFrame.add_prefix
and last DataFrame.join
to original:
df = (data.join(data[data['outlier']].groupby('key')['time'].agg(['mean','median', 'count'])
.add_prefix('outlier_'), on='key'))
print (df)
key time outlier outlier_mean outlier_median outlier_count
0 1_2 4 False 3 2 3
1 1_2 2 True 3 2 3
2 1_2 2 True 3 2 3
3 1_2 5 True 3 2 3
4 1_2 6 False 3 2 3
5 1_3 10 False 15 15 2
6 1_3 12 False 15 15 2
7 1_3 10 True 15 15 2
8 1_3 20 True 15 15 2
EDIT: Solution for processing True
s and False
s columns separately:
df = data.groupby(['key','outlier'])['time'].agg(['mean','median', 'count']).unstack()
df.columns = df.columns.map(lambda x: f'outlier_{x[0]}_{x[1]}')
df = data.join(df, on='key')
print (df)
key time outlier outlier_mean_False outlier_mean_True \
0 1_2 4 False 5 3
1 1_2 2 True 5 3
2 1_2 2 True 5 3
3 1_2 5 True 5 3
4 1_2 6 False 5 3
5 1_3 10 False 11 15
6 1_3 12 False 11 15
7 1_3 10 True 11 15
8 1_3 20 True 11 15
outlier_median_False outlier_median_True outlier_count_False \
0 5 2 2
1 5 2 2
2 5 2 2
3 5 2 2
4 5 2 2
5 11 15 2
6 11 15 2
7 11 15 2
8 11 15 2
outlier_count_True
0 3
1 3
2 3
3 3
4 3
5 2
6 2
7 2
Upvotes: 1