Reputation: 22001
col_a col_b
a 10
a 20
c 10
c 5
d 20
e 30
The total of col_b
is 95. I want to select only those rows where sum of col_b
values exceeds 80% of the total (95). In this case, the sum by each group is
a 30
c 15
d 20
e 30
In this case, we start from the largest and keep including col_a
rows till 80& of 95 i.e. 76 is exceeded. Therefore the end result will be:
col_a col_b
a 10
a 20
d 20
e 30
I.e. we will exclude rows corresponding to c
in col_a
. How do I achieve this using Pandas?
Upvotes: 0
Views: 38
Reputation: 1265
a way to do this is:
(
df
.set_index('col_a')[
df
.groupby('col_a')
.sum()
.sort_values(by='col_b', ascending=False)
.cumsum()
.lt(df.col_b.sum()*0.8)
.shift(fill_value=True)
]
.dropna()
.reset_index()
)
or the same in one long line:
df.set_index('col_a')[df.groupby('col_a').sum().sort_values(by='col_b', ascending=False).cumsum().lt(df.col_b.sum()*0.8).shift(fill_value=True)].dropna().reset_index()
Output:
col_a col_b
0 a 10.0
1 a 20.0
2 d 20.0
3 e 30.0
Upvotes: 1