qcut with non-unique bin edges produces wrong number of quantiles

Question

I have this simple method to get my deciles:

def output_deciles(model, X, y, order='predictions'):
    results = pd.DataFrame(model.predict(X), index=X.index, columns=['predictions'])
    results['actual'] = y
    results['deciles'] = pd.qcut(results[order], 10, labels=False, duplicates='drop')
    return results

Everything works fine if I use it on deciles from my predictions:

out = output_deciles(pipeline, X, y)
out.groupby('deciles')[['actual', 'predictions']].mean()

this is on some 9400 records.

but if I try to get deciles of my actual values I only get 7 instead of 10 deciles. This is because about half of my values in this target were 0:

out = output_deciles(pipeline, X, y, order='actual')
out.groupby('deciles')[['actual', 'predictions']].mean()

this is despite having a huge number of unique values:

print(len(out['actual'].unique()))

4593

That is counterintuitive - it's almost as if it is dropping whole bins instead of just some duplicate values. But if I change the duplicates setting to 'raise', it throws:

ValueError: Bin edges must be unique: array([-4.60517019, 0. , 0. , 0. , 0. , 3.47630251, 8.40045698, 10.11776099, 11.46706716, 12.86027487, 17.7007044 ]).

How do I get deciles, when I mean deciles, given non-unique bin edges?

ALollz · Accepted Answer

What you might be looking for is a way to construct the "quantiles" yourself. You can do this by sorting and then using integer division to define the group.

I'll create data with an excessive mass at 0, such that pd.qcut will complain about duplicates.

import pandas as pd
import numpy as np

np.random.seed(410012)
s = pd.Series(np.random.normal(0, 4, 1000))
s = pd.concat([s, pd.Series([0]*500)])
s = s.to_frame('vals')

N = 10
s = s.sort_values('vals')
s['q'] = np.arange(len(s)) // (len(s)/N)

With q we now get 10 bins regardless.

s.groupby('q').describe()
#      vals                                                          
#     count    mean     std      min     25%     50%     75%      max
#q                                                                   
#0.0  150.0 -6.5934  1.9208 -12.6041 -7.7703 -6.1546 -5.1073  -4.3421
#1.0  150.0 -3.1922  0.5621  -4.3287 -3.6605 -3.1293 -2.7377  -2.2718
#2.0  150.0 -1.4932  0.4203  -2.2561 -1.8196 -1.5262 -1.1364  -0.7451
#3.0  150.0 -0.1831  0.2400  -0.7425 -0.3371 -0.0110  0.0000   0.0000
#4.0  150.0  0.0000  0.0000   0.0000  0.0000  0.0000  0.0000   0.0000
#5.0  150.0  0.0000  0.0000   0.0000  0.0000  0.0000  0.0000   0.0000
#6.0  150.0  0.0238  0.0678   0.0000  0.0000  0.0000  0.0000   0.2856
#7.0  150.0  1.1555  0.4833   0.3353  0.7615  1.1837  1.5819   1.9513
#8.0  150.0  2.9430  0.6016   1.9660  2.4385  2.9665  3.4764   4.0277
#9.0  150.0  6.1692  1.6616   4.0336  4.8805  5.8176  6.9019  12.3437

The bins that don't overlap the problematic value are identical but the two bins where 0 is the edge are different (because they have been collapsed)

s.groupby(pd.qcut(s['vals'], 10, duplicates='drop'))['vals'].describe()
#                              count    mean     std      min     25%     50%     75%      max
#vals                                                                                         
#(-12.604999999999999, -4.33]  150.0 -6.5934  1.9208 -12.6041 -7.7703 -6.1546 -5.1073  -4.3421
#(-4.33, -2.259]               150.0 -3.1922  0.5621  -4.3287 -3.6605 -3.1293 -2.7377  -2.2718
#(-2.259, -0.743]              150.0 -1.4932  0.4203  -2.2561 -1.8196 -1.5262 -1.1364  -0.7451
#(-0.743, 0.0]                 576.0 -0.0477  0.1463  -0.7425  0.0000  0.0000  0.0000   0.0000
#(0.0, 0.301]                   24.0  0.1490  0.1016   0.0024  0.0457  0.1497  0.2485   0.2856
#(0.301, 1.954]                150.0  1.1555  0.4833   0.3353  0.7615  1.1837  1.5819   1.9513
#(1.954, 4.028]                150.0  2.9430  0.6016   1.9660  2.4385  2.9665  3.4764   4.0277
#(4.028, 12.344]               150.0  6.1692  1.6616   4.0336  4.8805  5.8176  6.9019  12.3437

qcut with non-unique bin edges produces wrong number of quantiles

Answers (1)

Related Questions