PANDAS Quantiles different answer depending on how we ask it

Question

New take on a problem from two days ago.

When you have a big set of data, PANDAS can let you filter things in-place while generating statistics, so you don't have to make new dataframes for everything. Right now, I suspect there's a nuance to doing this, as doing things two different ways - one the 'smart' way, and one the 'explicit' way, gives me very difference answers, and one is clearly right (the explicit), and one is clearly wrong (the smart way).

Can someone point me in the direction of what I'm missing.

Here's are details.

I've got 2.5 million rows of data, describing the time to failure for a range of different equipment types. Descriptive statistics for two of these are below. (Note these are actually mining equipment, but I have to sanitise something, so I've replaced the names).

                                   Count        Min        Mean          Max
CC_CauseLocationEquipmentType                                           
Coffee Machines                    204136  0.000556   71.797146  23407.41667
Blenders                            52424  0.008333  750.880591  23077.79167

If I try to get quantiles of the failure times for the equipment, using

print(df2.groupby("CC_CauseLocationEquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]))

then I see the following quantiles.

                                            CC_TBF
CC_CauseLocationEquipmentType                     
Coffee Machines               0.10000     0.005556
                              0.25000     0.238889
                              0.50000     1.775000
                              0.75000     2.595833
                              0.90000     4.611389
                              0.95000     7.008125
                              0.99000    15.465278
                              0.99999    21.089619    <-- Nowhere near the max of 23407.41667
Blenders                      0.10000    57.731806
                              0.25000   394.004375
                              0.50000     0.288889
                              0.75000     7.201528
                              0.90000    51.015667
                              0.95000    83.949833
                              0.99000   123.148019
                              0.99999   133.708716    <-- Nowhere near the max of 23077.79167

This is clearly incorrect because the 99.999% quantile is nowhere near the maximum values for those equipment types. Also note the 'reset' partway through the Blender quantile dataset. I asked about this previously with Pandas quantiles misbehaving by... getting smaller partway through a range of percentiles? That's still unresolved, but I think this new question will help solve that problem as well.

When I consider the quantiles separately, I get more sensible results. Using the following code:

print("Quantiles - type by type - Coffee Machines")
df3=df2.loc[df2['CC_CauseLocationEquipmentType'] == "Coffee Machines"]
print(df3.quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]) )

print("Quantiles - type by type - Blenders")
df3=df2.loc[df2['CC_CauseLocationEquipmentType'] == "Blenders"]
print(df3.quantile([.1, .25, .5, .75,0.9,0.95,0.99,0.99999]) )

I now get:

Quantiles - type by type - Coffee Machines
               CC_TBF
0.10000      0.120556
0.25000      0.608333
0.50000      2.040556
0.75000      3.494514
0.90000     18.047917
0.95000    129.798403
0.99000   1644.764861
0.99999  23003.517729   <-- Pretty darn close to the max of 23407
Quantiles - type by type - Blenders
               CC_TBF
0.10000      0.226111
0.25000      0.941667
0.50000     37.924167
0.75000    388.554444
0.90000   1955.252500
0.95000   4301.835320
0.99000  11362.310594
0.99999  22831.372845   <--- pretty darn close to the max of 23077

Can anyone tell me why these two different approaches give such different results, and why my use of groupby is giving me such a weird, seemingly arbitrary result? Does groupby somehow use a different sub-set of the full data than what I have done in the second "explicit" approach?

Alexander Pivovarov · Accepted Answer

Very interesting. There is clearly a bug at least in my version of Pandas (0.25.1) with df.groupby(...).quantile(). That codepath is different and seems to be broken even on very simple examples like:

df = pd.DataFrame(
    {"A": [0., 0., 0.], "B": ["X", "Y", "Z"]}
)
result = df.groupby("B").quantile([0.5, 0.9])

While it would work on a 2-element version one:

df = pd.DataFrame(
    {"A": [0., 0.], "B": ["X", "Y"]}
)
result = df.groupby("B").quantile([0.5, 0.9])

I would avoid using groupby with quantile on array-like objects until the code is fixed even in cases it works now since it is likely error-prone.

Blame also shows a lot of fairly fresh updates there (10, 16 months) dealing exactly with these pieces of code too.

PANDAS Quantiles different answer depending on how we ask it

Answers (2)

Related Questions