Reputation: 4687
I have this simple method to get my deciles:
def output_deciles(model, X, y, order='predictions'):
results = pd.DataFrame(model.predict(X), index=X.index, columns=['predictions'])
results['actual'] = y
results['deciles'] = pd.qcut(results[order], 10, labels=False, duplicates='drop')
return results
Everything works fine if I use it on deciles from my predictions:
out = output_deciles(pipeline, X, y)
out.groupby('deciles')[['actual', 'predictions']].mean()
this is on some 9400 records.
but if I try to get deciles of my actual values I only get 7 instead of 10 deciles. This is because about half of my values in this target were 0:
out = output_deciles(pipeline, X, y, order='actual')
out.groupby('deciles')[['actual', 'predictions']].mean()
this is despite having a huge number of unique values:
print(len(out['actual'].unique()))
4593
That is counterintuitive - it's almost as if it is dropping whole bins instead of just some duplicate values. But if I change the duplicates setting to 'raise', it throws:
ValueError: Bin edges must be unique: array([-4.60517019, 0. , 0. , 0. , 0. , 3.47630251, 8.40045698, 10.11776099, 11.46706716, 12.86027487, 17.7007044 ]).
How do I get deciles, when I mean deciles, given non-unique bin edges?
Upvotes: 0
Views: 264
Reputation: 59579
What you might be looking for is a way to construct the "quantiles" yourself. You can do this by sorting and then using integer division to define the group.
I'll create data with an excessive mass at 0, such that pd.qcut
will complain about duplicates.
import pandas as pd
import numpy as np
np.random.seed(410012)
s = pd.Series(np.random.normal(0, 4, 1000))
s = pd.concat([s, pd.Series([0]*500)])
s = s.to_frame('vals')
N = 10
s = s.sort_values('vals')
s['q'] = np.arange(len(s)) // (len(s)/N)
With q we now get 10 bins regardless.
s.groupby('q').describe()
# vals
# count mean std min 25% 50% 75% max
#q
#0.0 150.0 -6.5934 1.9208 -12.6041 -7.7703 -6.1546 -5.1073 -4.3421
#1.0 150.0 -3.1922 0.5621 -4.3287 -3.6605 -3.1293 -2.7377 -2.2718
#2.0 150.0 -1.4932 0.4203 -2.2561 -1.8196 -1.5262 -1.1364 -0.7451
#3.0 150.0 -0.1831 0.2400 -0.7425 -0.3371 -0.0110 0.0000 0.0000
#4.0 150.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#5.0 150.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#6.0 150.0 0.0238 0.0678 0.0000 0.0000 0.0000 0.0000 0.2856
#7.0 150.0 1.1555 0.4833 0.3353 0.7615 1.1837 1.5819 1.9513
#8.0 150.0 2.9430 0.6016 1.9660 2.4385 2.9665 3.4764 4.0277
#9.0 150.0 6.1692 1.6616 4.0336 4.8805 5.8176 6.9019 12.3437
The bins that don't overlap the problematic value are identical but the two bins where 0 is the edge are different (because they have been collapsed)
s.groupby(pd.qcut(s['vals'], 10, duplicates='drop'))['vals'].describe()
# count mean std min 25% 50% 75% max
#vals
#(-12.604999999999999, -4.33] 150.0 -6.5934 1.9208 -12.6041 -7.7703 -6.1546 -5.1073 -4.3421
#(-4.33, -2.259] 150.0 -3.1922 0.5621 -4.3287 -3.6605 -3.1293 -2.7377 -2.2718
#(-2.259, -0.743] 150.0 -1.4932 0.4203 -2.2561 -1.8196 -1.5262 -1.1364 -0.7451
#(-0.743, 0.0] 576.0 -0.0477 0.1463 -0.7425 0.0000 0.0000 0.0000 0.0000
#(0.0, 0.301] 24.0 0.1490 0.1016 0.0024 0.0457 0.1497 0.2485 0.2856
#(0.301, 1.954] 150.0 1.1555 0.4833 0.3353 0.7615 1.1837 1.5819 1.9513
#(1.954, 4.028] 150.0 2.9430 0.6016 1.9660 2.4385 2.9665 3.4764 4.0277
#(4.028, 12.344] 150.0 6.1692 1.6616 4.0336 4.8805 5.8176 6.9019 12.3437
Upvotes: 1