Sean
Sean

Reputation: 41

Code to find top 95 percent of column values in dataframe

I am looking for help gathering the top 95 percent of sales in a Pandas Data frame where I need to group by a category column. I found the following (top section of code) which is close. TotalDollars in my df gets properly sorted in descending fashion, but the resulting number of rows includes more than top 95% of total dollars.

Total Dollars     Percent     Running Percent

117388     11.09%    11.09%

81632     7.71%     18.80%

46316     4.38%     23.18%

41500     3.92%     27.10%

after hitting 95% running total percent want to eliminate remaining rows for that category. I don't need Percent or Running Percent as df fields (given for illustrative purposes only).

df1 = (df.groupby('channel',group_keys=False)
        .apply(lambda x: x.nlargest(int(len(x) * a), 'score')))

my code:

df_out = (df_Sales.groupby('category', group_keys=False).apply(lambda x: x.nlargest(int(len(x) * 0.95), 'TotalDollars')))

Upvotes: 0

Views: 3338

Answers (1)

zero
zero

Reputation: 1725

import pandas as pd
import numpy as np


np.random.seed(100)

test_df = pd.DataFrame({
    'group': ['A'] * 5 + ['B'] * 5,
    'value': np.random.randint(1,100,10)
})


def retain_quantile(df, percentile=0.95):
    percentile_val = df['value'].quantile(percentile)

    return df[df['value'] <= percentile_val]

grouped_df = test_df.groupby('group').apply(retain_quantile)
grouped_df

        group  value
group
A     0     A      9
      1     A     25
      2     A     68
      4     A     80
B     5     B     49
      6     B     11
      7     B     95
      8     B     53

if you're planning on using this for multiple columns, it would be a lot more complicated but the approach is very similar.

Upvotes: 1

Related Questions