Reputation: 521
I have the following dataframe:
d = {'group': ['a', 'b', 'c', 'b', 'b', 'c', 'a', 'b', 'a'],
'cum_sum': [1, 4, 3, 9, 15, 6, 3, 17, 4]}
df = pd.DataFrame(data=d)
I want to filter the whole dataframe to only keep the records in each group that have cum_sum less than maximum cum_sum x 0.8 in each group.
I tried to play around with lambda parameters and to filter this:
grouped = df.groupby('group')
grouped.filter(lambda x: x[x.cum_sum] <= x[x.cum_sum.max()])
But I simply have no idea where to actually start... Any ideas..?
UPDATED ACCORDING TO COMMENTS!
Upvotes: 1
Views: 865
Reputation: 168
Not so elegant solution, but it works.
grouped = df.groupby('group').max()['cum_sum'].reset_index()
grouped.columns=['group','max_cum_sum']
df = df.merge(grouped)
df = df.loc[df['cum_sum'] <= df['max_cum_sum']*0.8]
Upvotes: 0
Reputation: 863166
You can use boolean indexing
:
df1 = df[df['cum_sum'].lt(df.groupby('group')['cum_sum'].transform('max') * 0.8)]
print (df1)
group cum_sum
0 a 1
1 b 4
2 c 3
3 b 9
6 a 3
Explanation:
First use GroupBy.transform
with max
for Series with same size like original DataFrame
:
print (df.groupby('group')['cum_sum'].transform('max'))
0 4
1 17
2 6
3 17
4 17
5 6
6 4
7 17
8 4
Name: cum_sum, dtype: int64
Multiple by constant:
print (df.groupby('group')['cum_sum'].transform('max') * 0.8)
0 3.2
1 13.6
2 4.8
3 13.6
4 13.6
5 4.8
6 3.2
7 13.6
8 3.2
Name: cum_sum, dtype: float64
Compare by Series.lt
for <
:
print (df['cum_sum'].lt(df.groupby('group')['cum_sum'].transform('max') * 0.8))
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 False
8 False
Name: cum_sum, dtype: bool
Upvotes: 1