Reputation: 21
Under certain settings, np.quantile makes mistakes in determining the correct quantile. Is this a bug?
x = np.array([374, 358, 341, 355, 342, 334, 353, 346, 355, 344,
349, 330, 352, 328, 336, 359, 361, 345, 324, 386,
334, 370, 349, 327, 342, 354, 361, 354, 377, 324])
q = np.quantile(x, 0.25)
print(q)
print(len(x[x<=q]) / len(x))
print(len(x[x>=q]) / len(x))
Output:
337.25
0.26666666666666666
0.7333333333333333
0.73 means that only 73% of values are larger or equal than the determined quantile; by definition it should be >= 75%
Upvotes: 2
Views: 2440
Reputation: 1
The problem is that you include the q value on both sides of the inequality.
print(len(x[x<=q]) / len(x))
print(len(x[x>=q]) / len(x))
correct answer comes with
print(len(x[x<=q]) / len(x))
print(len(x[x>q]) / len(x))
0.26666666666666666
0.7333333333333333
sum=1.0
Upvotes: 0
Reputation: 1204
As @SamProell stated, there are different conventions to calculate centiles, as you can see here with quartile's computing methods (american way). Here we have an even number of data so let's stick to the first method and let's try to see how we would do it "by hand".
First, sort the data:
> x2=np.sort(x)
> print(x2)
array([324, 324, 327, 328, 330, 334, 334, 336, 341, 342, 342, 344, 345,
346, 349, 349, 352, 353, 354, 354, 355, 355, 358, 359, 361, 361,
370, 374, 377, 386])
Then divide the data in two halves:
> x2_low = x2[:int(len(x2)/2)]
array([324, 324, 327, 328, 330, 334, 334, 336, 341, 342, 342, 344, 345,
346, 349])
> x2_up = x2[int(len(x2)/2):]
array([349, 352, 353, 354, 354, 355, 355, 358, 359, 361, 361, 370, 374,
377, 386])
Finally find the median (i.e. the value cutting your data in half). Here lies a choice as len(x2_low)=15
. You could say that the median of x2_low is its 8th value (index 7 in python), then:
> q = x2_low[int(len(x2_low)/2)]
336
> len(x2_low[x2_low<q])
7
> len(x2_low[x2_low>q])
7
this is also what np.median(x2_low)
would return, or even q=np.percentile(x2,25,interpolation='lower')
. But you would still get:
> len(x[x<q])/len(x)
0.2333333333334
As your number of data is not a multiple of 4. Now it all depends on what you want to achieve, here are the results you can get for all interpolation parameter:
linear
: default one, you got it in your question
lower
: see above
higher
:
> q=np.percentile(x,25,interpolation='higher')
341
> len(x[x>q])/len(x)
0.7
> len(x[x<q])/len(x)
0.26666666666666666
nearest
:
> q=np.percentile(x,25,interpolation='nearest')
336
> len(x[x>q])/len(x)
0.7333333333333333
> len(x[x<q])/len(x)
0.23333333333333334
and finally midpoint
:
> q=np.percentile(x,25,interpolation='midpoint')
> len(x[x>q])/len(x)
0.7333333333333333
> len(x[x<q])/len(x)
0.26666666666666666
It all depends on what you want to do with this afterwards. For more information on the different calculation methods, check for numpy's documentation.
Upvotes: 1
Reputation: 86
https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/function_base.py#L3543-L3644
default value is linear
interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
This optional parameter specifies the interpolation method to
use when the desired quantile lies between two data points
``i < j``:
* linear: ``i + (j - i) * fraction``, where ``fraction``
is the fractional part of the index surrounded by ``i``
and ``j``.
* lower: ``i``.
* higher: ``j``.
* nearest: ``i`` or ``j``, whichever is nearest.
* midpoint: ``(i + j) / 2``.
If you select 'higher' you get what you want
Upvotes: 1