Reputation: 4792
I have a pandas dataframe with a column of continous variables. I need to convert them into 3 bins, such that first bin encompases values <20 percentile, second between 20 and 80th percentile and last is >80th percentile.
I am trying to achieve it by first getting the bin boundaries for such percentiles and then using pandas cut function. The issue is that I get an odd results, of getting only the middle bin encoded. Please see below:
test = [x for x in range(0,100)]
a = pd.DataFrame(test)
np.percentile(a, [20, 80])
Out[52]: array([ 19.8, 79.2])
pd.cut(a[0], np.percentile(a[0], [20, 80]))
...
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 (19.8, 79.2]
21 (19.8, 79.2]
22 (19.8, 79.2]
...
78 (19.8, 79.2]
79 (19.8, 79.2]
80 NaN
Why is that so? I though pandas cut requires you to supply boundaries of bins you want to get. Supplying 2 boundaries I supposed to get 3 bins, but seems like it doesn't work this way.
Upvotes: 3
Views: 6737
Reputation: 323276
If you need 3 bins , then you need 4 break..
test = [x for x in range(0,100)]
a = pd.DataFrame(test)
np.percentile(a, [0,20, 80,100])
Out[527]: array([ 0. , 19.8, 79.2, 99. ])
pd.cut(a[0], np.percentile(a[0], [0,20, 80,100]))
Also, in pandas we have qcut
, which means you do not need get the bin from numpy
pd.qcut(a[0],[0,0.2,0.8,1])
Upvotes: 7