Carl
Carl

Reputation: 608

pandas qcut not putting equal number of observations into each bin

I have a data frame, from which I can select a column (series) as follows:

df:

            value_rank
275488          90
275490          35
275491          60
275492          23
275493          23
275494          34
275495          75
275496          40
275497          69
275498          14
275499          83
...             ...

value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile

pd.qcut(df.value_rank, 5, labels=False)


275488    4
275490    1
275491    3
275492    1
275493    1
275494    1
275495    3
275496    2
...      ...

This appears fine, as expected, but it isn't.

In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:

obs =  pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()

0    329
3    314
1    313
4    311
2    302

I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!

Here is a small example:

df:

            value_rank
286742               11
286835               53
286865               40
286930               31
286936               45
286955               27
287031               30
287111               36
287269               30
287310               18

pd.qcut gives this:

pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin  count
1    3
4    2
3    2
0    2
2    1

There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!

Upvotes: 15

Views: 17142

Answers (4)

Robert Rodkey
Robert Rodkey

Reputation: 423

qcut is trying to compensate for repeating values. This is easier to visualize if you return the bin limits along with your qcut results:

In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')

In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
 array([ 11. ,  25.2,  30. ,  33. ,  41. ,  53. ]))

You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?

Upvotes: 14

fulowa
fulowa

Reputation: 11

This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:

import random

df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);

picture: https://i.sstatic.net/ztjzn.png

df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df

picture: https://i.sstatic.net/RB4TN.png

df.groupby('qcut_v1').count().reset_index()

picture: https://i.sstatic.net/IKtsW.png

df.groupby('qcut_v2').count().reset_index()

picture: https://i.sstatic.net/4jrkU.png

sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

Upvotes: 1

AnksG
AnksG

Reputation: 546

Just try with the below code :

pd.qcut(df.rank(method='first'),nbins)

Upvotes: 13

scottlittle
scottlittle

Reputation: 20912

If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.

test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]

np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data

test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)

Output:

(0    0
 1    0
 2    1
 3    2
 4    1
 5    2
 6    3
 7    3
 8    4
 9    4
 Name: value_rank, dtype: int64,
 array([ 11.37454012,  25.97573801,  30.42160255,  33.11683016,
         41.81316392,  53.70807258]))

So, now we have two 0's, two 1's, two 2's and two 4's!

Disclaimer

Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.

Upvotes: 4

Related Questions