Reputation: 41
I am looking for help gathering the top 95 percent of sales in a Pandas Data frame where I need to group by a category column. I found the following (top section of code) which is close. TotalDollars
in my df gets properly sorted in descending fashion, but the resulting number of rows includes more than top 95% of total dollars.
Total Dollars Percent Running Percent
117388 11.09% 11.09%
81632 7.71% 18.80%
46316 4.38% 23.18%
41500 3.92% 27.10%
after hitting 95% running total percent want to eliminate remaining rows for that category. I don't need Percent or Running Percent as df fields (given for illustrative purposes only).
df1 = (df.groupby('channel',group_keys=False)
.apply(lambda x: x.nlargest(int(len(x) * a), 'score')))
my code:
df_out = (df_Sales.groupby('category', group_keys=False).apply(lambda x: x.nlargest(int(len(x) * 0.95), 'TotalDollars')))
Upvotes: 0
Views: 3338
Reputation: 1725
import pandas as pd
import numpy as np
np.random.seed(100)
test_df = pd.DataFrame({
'group': ['A'] * 5 + ['B'] * 5,
'value': np.random.randint(1,100,10)
})
def retain_quantile(df, percentile=0.95):
percentile_val = df['value'].quantile(percentile)
return df[df['value'] <= percentile_val]
grouped_df = test_df.groupby('group').apply(retain_quantile)
grouped_df
group value
group
A 0 A 9
1 A 25
2 A 68
4 A 80
B 5 B 49
6 B 11
7 B 95
8 B 53
if you're planning on using this for multiple columns, it would be a lot more complicated but the approach is very similar.
Upvotes: 1