Pepino Do Mar
Pepino Do Mar

Reputation: 21

np.quantile with wrong calculation?

Under certain settings, np.quantile makes mistakes in determining the correct quantile. Is this a bug?

x = np.array([374, 358, 341, 355, 342, 334, 353, 346, 355, 344,
              349, 330, 352, 328, 336, 359, 361, 345, 324, 386,
              334, 370, 349, 327, 342, 354, 361, 354, 377, 324])

q = np.quantile(x, 0.25)

print(q)

print(len(x[x<=q]) / len(x))

print(len(x[x>=q]) / len(x))

Output:

337.25

0.26666666666666666

0.7333333333333333

0.73 means that only 73% of values are larger or equal than the determined quantile; by definition it should be >= 75%

Upvotes: 2

Views: 2440

Answers (3)

Pascal Breil
Pascal Breil

Reputation: 1

The problem is that you include the q value on both sides of the inequality.

print(len(x[x<=q]) / len(x))
print(len(x[x>=q]) / len(x))

correct answer comes with

print(len(x[x<=q]) / len(x))
print(len(x[x>q]) / len(x))
0.26666666666666666
0.7333333333333333
sum=1.0

Upvotes: 0

jeannej
jeannej

Reputation: 1204

As @SamProell stated, there are different conventions to calculate centiles, as you can see here with quartile's computing methods (american way). Here we have an even number of data so let's stick to the first method and let's try to see how we would do it "by hand".

First, sort the data:

> x2=np.sort(x)
> print(x2)
array([324, 324, 327, 328, 330, 334, 334, 336, 341, 342, 342, 344, 345,
       346, 349, 349, 352, 353, 354, 354, 355, 355, 358, 359, 361, 361,
       370, 374, 377, 386])

Then divide the data in two halves:

> x2_low = x2[:int(len(x2)/2)]
array([324, 324, 327, 328, 330, 334, 334, 336, 341, 342, 342, 344, 345,
       346, 349])
> x2_up = x2[int(len(x2)/2):]
array([349, 352, 353, 354, 354, 355, 355, 358, 359, 361, 361, 370, 374,
       377, 386])

Finally find the median (i.e. the value cutting your data in half). Here lies a choice as len(x2_low)=15. You could say that the median of x2_low is its 8th value (index 7 in python), then:

> q = x2_low[int(len(x2_low)/2)]
336
> len(x2_low[x2_low<q])
7
> len(x2_low[x2_low>q])
7

this is also what np.median(x2_low) would return, or even q=np.percentile(x2,25,interpolation='lower'). But you would still get:

> len(x[x<q])/len(x)
0.2333333333334

As your number of data is not a multiple of 4. Now it all depends on what you want to achieve, here are the results you can get for all interpolation parameter:

linear: default one, you got it in your question

lower: see above

higher:

> q=np.percentile(x,25,interpolation='higher')
341
> len(x[x>q])/len(x)
0.7
> len(x[x<q])/len(x)
0.26666666666666666

nearest:

> q=np.percentile(x,25,interpolation='nearest')
336
> len(x[x>q])/len(x)
0.7333333333333333
> len(x[x<q])/len(x)
0.23333333333333334

and finally midpoint:

> q=np.percentile(x,25,interpolation='midpoint')
> len(x[x>q])/len(x)
0.7333333333333333
> len(x[x<q])/len(x)
0.26666666666666666

It all depends on what you want to do with this afterwards. For more information on the different calculation methods, check for numpy's documentation.

Upvotes: 1

https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/function_base.py#L3543-L3644

default value is linear
    interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
        This optional parameter specifies the interpolation method to
        use when the desired quantile lies between two data points
        ``i < j``:
            * linear: ``i + (j - i) * fraction``, where ``fraction``
              is the fractional part of the index surrounded by ``i``
              and ``j``.
            * lower: ``i``.
            * higher: ``j``.
            * nearest: ``i`` or ``j``, whichever is nearest.
            * midpoint: ``(i + j) / 2``.

If you select 'higher' you get what you want

Upvotes: 1

Related Questions