Reputation: 991
All,
i am trying to split my data into 3 buckets that is 40%, 40% and 20%. How can i do this using pandas? e.g. so you get the bottom lowest 40%, middle 40% and top 20% :
pd.cut(df['count'], 5,labels = ['1','2','3','4','5'],retbins=True)
above splits into 5 quintiles, but i would like to 40:40:20 percentiles defined.
any ideas?
Upvotes: 3
Views: 13984
Reputation: 46
You are on the right path. From the wording in your question I'm not sure if you want to bin the data based on the range of possible values or the actual distribution of values. I'll show both.
Use pd.cut() for binning data based on the range of possible values. Use pd.qcut() for binning data based on the actual distribution of values.
import pandas as pd
import numpy as np
data = np.random.randint(0, 100, 100)
labels = ['Bottom 40%', 'Middle 40%', 'Top 20%']
# bin data based on range of possible values
df['possible'] = pd.cut(df['count'], [0, 40, 80, 100], labels=labels)
# bin data based on distribution of values
df['distribution'] = pd.qcut(df['count'], [0., .4, .8, 1.], labels=labels)
top20possible = df.loc[df['possible'] == 'Top 20%']
top20distribution = df.loc[df['distribution'] == 'Top 20%']
Upvotes: 3
Reputation: 120399
Use qcut
instead:
df["quantile"] = pd.qcut(df["count"], q=[0, 0.4, 0.8, 1],
labels=["lowest", "middle", "top"])
>>> df.sort_values("count")
count quantile
4 5 lowest
7 7 lowest
6 9 lowest
15 9 lowest
5 17 lowest
1 19 lowest
9 22 lowest
2 25 lowest
16 25 lowest
12 29 middle
19 31 middle
18 32 middle
0 32 middle
10 36 middle
14 38 middle
3 39 middle
8 44 top
11 47 top
17 47 top
13 48 top
To get the bins:
df["quantile"], bins = pd.qcut(df["count"], q=[0, 0.4, 0.8, 1],
labels=["lowest", "middle", "top"],
retbins=True)
Upvotes: 6