Reputation: 33
New take on a problem from two days ago.
When you have a big set of data, PANDAS can let you filter things in-place while generating statistics, so you don't have to make new dataframes for everything. Right now, I suspect there's a nuance to doing this, as doing things two different ways - one the 'smart' way, and one the 'explicit' way, gives me very difference answers, and one is clearly right (the explicit), and one is clearly wrong (the smart way).
Can someone point me in the direction of what I'm missing.
Here's are details.
I've got 2.5 million rows of data, describing the time to failure for a range of different equipment types. Descriptive statistics for two of these are below. (Note these are actually mining equipment, but I have to sanitise something, so I've replaced the names).
Count Min Mean Max
CC_CauseLocationEquipmentType
Coffee Machines 204136 0.000556 71.797146 23407.41667
Blenders 52424 0.008333 750.880591 23077.79167
If I try to get quantiles of the failure times for the equipment, using
print(df2.groupby("CC_CauseLocationEquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]))
then I see the following quantiles.
CC_TBF
CC_CauseLocationEquipmentType
Coffee Machines 0.10000 0.005556
0.25000 0.238889
0.50000 1.775000
0.75000 2.595833
0.90000 4.611389
0.95000 7.008125
0.99000 15.465278
0.99999 21.089619 <-- Nowhere near the max of 23407.41667
Blenders 0.10000 57.731806
0.25000 394.004375
0.50000 0.288889
0.75000 7.201528
0.90000 51.015667
0.95000 83.949833
0.99000 123.148019
0.99999 133.708716 <-- Nowhere near the max of 23077.79167
This is clearly incorrect because the 99.999% quantile is nowhere near the maximum values for those equipment types. Also note the 'reset' partway through the Blender quantile dataset. I asked about this previously with Pandas quantiles misbehaving by... getting smaller partway through a range of percentiles? That's still unresolved, but I think this new question will help solve that problem as well.
When I consider the quantiles separately, I get more sensible results. Using the following code:
print("Quantiles - type by type - Coffee Machines")
df3=df2.loc[df2['CC_CauseLocationEquipmentType'] == "Coffee Machines"]
print(df3.quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]) )
print("Quantiles - type by type - Blenders")
df3=df2.loc[df2['CC_CauseLocationEquipmentType'] == "Blenders"]
print(df3.quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]) )
I now get:
Quantiles - type by type - Coffee Machines
CC_TBF
0.10000 0.120556
0.25000 0.608333
0.50000 2.040556
0.75000 3.494514
0.90000 18.047917
0.95000 129.798403
0.99000 1644.764861
0.99999 23003.517729 <-- Pretty darn close to the max of 23407
Quantiles - type by type - Blenders
CC_TBF
0.10000 0.226111
0.25000 0.941667
0.50000 37.924167
0.75000 388.554444
0.90000 1955.252500
0.95000 4301.835320
0.99000 11362.310594
0.99999 22831.372845 <--- pretty darn close to the max of 23077
Can anyone tell me why these two different approaches give such different results, and why my use of groupby
is giving me such a weird, seemingly arbitrary result? Does groupby
somehow use a different sub-set of the full data than what I have done in the second "explicit" approach?
Upvotes: 3
Views: 1285
Reputation: 7121
You can't see quantile at work in both examples in the answer from @alexander-pivovarov. There are only zeros and only one element for each group, so the result is always zero. Or am I wrong here?
I have pandas 0.25.3 and get useful results for
import pandas as pd
df = pd.DataFrame(
{"A": [1., 2., 3., 4., 5., 6.], "B": ["X", "X", "Y", "Y", "Z", "Z"]}
)
result = df.groupby("B").quantile([0.5, 0.9])
print(result)
Output:
A
B
X 0.5 1.5
0.9 1.9
Y 0.5 3.5
0.9 3.9
Z 0.5 5.5
0.9 5.9
If it works with a single number passed to quantiles()
you could hack something like
q = [0.2, 0.5, 0.9]
res = [df.groupby("B").quantile(_).loc['X', 'A'] for _ in q]
df_q = pd.DataFrame({'A':res, 'quantiles':q})
print(df_q)
Output:
A quantiles
0 1.2 0.2
1 1.5 0.5
2 1.9 0.9
until it is fixed.
Upvotes: 1
Reputation: 4990
Very interesting. There is clearly a bug at least in my version of Pandas (0.25.1) with df.groupby(...).quantile(<array-like>)
. That codepath is different and seems to be broken even on very simple examples like:
df = pd.DataFrame(
{"A": [0., 0., 0.], "B": ["X", "Y", "Z"]}
)
result = df.groupby("B").quantile([0.5, 0.9])
While it would work on a 2-element version one:
df = pd.DataFrame(
{"A": [0., 0.], "B": ["X", "Y"]}
)
result = df.groupby("B").quantile([0.5, 0.9])
I would avoid using groupby with quantile on array-like objects until the code is fixed even in cases it works now since it is likely error-prone.
Blame also shows a lot of fairly fresh updates there (10, 16 months) dealing exactly with these pieces of code too.
Upvotes: 2