Maths12
Maths12

Reputation: 991

How can i split my data in pandas into specified buckets e.g. 40-40-20?

All,

i am trying to split my data into 3 buckets that is 40%, 40% and 20%. How can i do this using pandas? e.g. so you get the bottom lowest 40%, middle 40% and top 20% :

pd.cut(df['count'], 5,labels = ['1','2','3','4','5'],retbins=True)

above splits into 5 quintiles, but i would like to 40:40:20 percentiles defined.

any ideas?

Upvotes: 3

Views: 13984

Answers (2)

jgarber79
jgarber79

Reputation: 46

You are on the right path. From the wording in your question I'm not sure if you want to bin the data based on the range of possible values or the actual distribution of values. I'll show both.

Use pd.cut() for binning data based on the range of possible values. Use pd.qcut() for binning data based on the actual distribution of values.

import pandas as pd
import numpy as np

data = np.random.randint(0, 100, 100)
labels = ['Bottom 40%', 'Middle 40%', 'Top 20%']

# bin data based on range of possible values
df['possible'] = pd.cut(df['count'], [0, 40, 80, 100], labels=labels)

# bin data based on distribution of values
df['distribution'] = pd.qcut(df['count'], [0., .4, .8, 1.], labels=labels)

top20possible = df.loc[df['possible'] == 'Top 20%']
top20distribution = df.loc[df['distribution'] == 'Top 20%']


Upvotes: 3

Corralien
Corralien

Reputation: 120399

Use qcut instead:

df["quantile"] = pd.qcut(df["count"], q=[0, 0.4, 0.8, 1],
                         labels=["lowest", "middle", "top"])
>>> df.sort_values("count")

    count quantile
4       5   lowest
7       7   lowest
6       9   lowest
15      9   lowest
5      17   lowest
1      19   lowest
9      22   lowest
2      25   lowest
16     25   lowest
12     29   middle
19     31   middle
18     32   middle
0      32   middle
10     36   middle
14     38   middle
3      39   middle
8      44      top
11     47      top
17     47      top
13     48      top

To get the bins:

df["quantile"], bins = pd.qcut(df["count"], q=[0, 0.4, 0.8, 1],
                               labels=["lowest", "middle", "top"],
                               retbins=True)

Upvotes: 6

Related Questions