Selecting groups in pandas dataframe based on percentage of total

Question

col_a   col_b
a        10
a        20
c        10
c        5
d        20
e        30

The total of col_b is 95. I want to select only those rows where sum of col_b values exceeds 80% of the total (95). In this case, the sum by each group is

a 30
c 15
d 20
e 30

In this case, we start from the largest and keep including col_a rows till 80& of 95 i.e. 76 is exceeded. Therefore the end result will be:

col_a   col_b
a        10
a        20
d        20
e        30

I.e. we will exclude rows corresponding to c in col_a. How do I achieve this using Pandas?

99_m4n · Accepted Answer

a way to do this is:

(
    df
        .set_index('col_a')[
            df
                .groupby('col_a')
                .sum()
                .sort_values(by='col_b', ascending=False)
                .cumsum()
                .lt(df.col_b.sum()*0.8)
                .shift(fill_value=True)
        ]
        .dropna()
        .reset_index()
)

or the same in one long line:

df.set_index('col_a')[df.groupby('col_a').sum().sort_values(by='col_b', ascending=False).cumsum().lt(df.col_b.sum()*0.8).shift(fill_value=True)].dropna().reset_index()

Output:

  col_a  col_b
0     a   10.0
1     a   20.0
2     d   20.0
3     e   30.0

Selecting groups in pandas dataframe based on percentage of total

Answers (1)

Related Questions