roberto tomás
roberto tomás

Reputation: 4687

qcut with non-unique bin edges produces wrong number of quantiles

I have this simple method to get my deciles:

def output_deciles(model, X, y, order='predictions'):
    results = pd.DataFrame(model.predict(X), index=X.index, columns=['predictions'])
    results['actual'] = y
    results['deciles'] = pd.qcut(results[order], 10, labels=False, duplicates='drop')
    return results

Everything works fine if I use it on deciles from my predictions:

out = output_deciles(pipeline, X, y)
out.groupby('deciles')[['actual', 'predictions']].mean()

groupby from predictions

this is on some 9400 records.

but if I try to get deciles of my actual values I only get 7 instead of 10 deciles. This is because about half of my values in this target were 0:

out = output_deciles(pipeline, X, y, order='actual')
out.groupby('deciles')[['actual', 'predictions']].mean()

groupby from actual

this is despite having a huge number of unique values:

print(len(out['actual'].unique()))

4593

That is counterintuitive - it's almost as if it is dropping whole bins instead of just some duplicate values. But if I change the duplicates setting to 'raise', it throws:

ValueError: Bin edges must be unique: array([-4.60517019, 0. , 0. , 0. , 0. , 3.47630251, 8.40045698, 10.11776099, 11.46706716, 12.86027487, 17.7007044 ]).

How do I get deciles, when I mean deciles, given non-unique bin edges?

Upvotes: 0

Views: 264

Answers (1)

ALollz
ALollz

Reputation: 59579

What you might be looking for is a way to construct the "quantiles" yourself. You can do this by sorting and then using integer division to define the group.

I'll create data with an excessive mass at 0, such that pd.qcut will complain about duplicates.

import pandas as pd
import numpy as np

np.random.seed(410012)
s = pd.Series(np.random.normal(0, 4, 1000))
s = pd.concat([s, pd.Series([0]*500)])
s = s.to_frame('vals')

N = 10
s = s.sort_values('vals')
s['q'] = np.arange(len(s)) // (len(s)/N)

With q we now get 10 bins regardless.

s.groupby('q').describe()
#      vals                                                          
#     count    mean     std      min     25%     50%     75%      max
#q                                                                   
#0.0  150.0 -6.5934  1.9208 -12.6041 -7.7703 -6.1546 -5.1073  -4.3421
#1.0  150.0 -3.1922  0.5621  -4.3287 -3.6605 -3.1293 -2.7377  -2.2718
#2.0  150.0 -1.4932  0.4203  -2.2561 -1.8196 -1.5262 -1.1364  -0.7451
#3.0  150.0 -0.1831  0.2400  -0.7425 -0.3371 -0.0110  0.0000   0.0000
#4.0  150.0  0.0000  0.0000   0.0000  0.0000  0.0000  0.0000   0.0000
#5.0  150.0  0.0000  0.0000   0.0000  0.0000  0.0000  0.0000   0.0000
#6.0  150.0  0.0238  0.0678   0.0000  0.0000  0.0000  0.0000   0.2856
#7.0  150.0  1.1555  0.4833   0.3353  0.7615  1.1837  1.5819   1.9513
#8.0  150.0  2.9430  0.6016   1.9660  2.4385  2.9665  3.4764   4.0277
#9.0  150.0  6.1692  1.6616   4.0336  4.8805  5.8176  6.9019  12.3437

The bins that don't overlap the problematic value are identical but the two bins where 0 is the edge are different (because they have been collapsed)

s.groupby(pd.qcut(s['vals'], 10, duplicates='drop'))['vals'].describe()
#                              count    mean     std      min     25%     50%     75%      max
#vals                                                                                         
#(-12.604999999999999, -4.33]  150.0 -6.5934  1.9208 -12.6041 -7.7703 -6.1546 -5.1073  -4.3421
#(-4.33, -2.259]               150.0 -3.1922  0.5621  -4.3287 -3.6605 -3.1293 -2.7377  -2.2718
#(-2.259, -0.743]              150.0 -1.4932  0.4203  -2.2561 -1.8196 -1.5262 -1.1364  -0.7451
#(-0.743, 0.0]                 576.0 -0.0477  0.1463  -0.7425  0.0000  0.0000  0.0000   0.0000
#(0.0, 0.301]                   24.0  0.1490  0.1016   0.0024  0.0457  0.1497  0.2485   0.2856
#(0.301, 1.954]                150.0  1.1555  0.4833   0.3353  0.7615  1.1837  1.5819   1.9513
#(1.954, 4.028]                150.0  2.9430  0.6016   1.9660  2.4385  2.9665  3.4764   4.0277
#(4.028, 12.344]               150.0  6.1692  1.6616   4.0336  4.8805  5.8176  6.9019  12.3437

Upvotes: 1

Related Questions