Reputation: 93
So, assume we have an array like that: 4,6,9,10,2,25,12,6,9 And then I try to calculate quantiles with numpy.quantile and statistics.quantile
import numpy as np
from statistics import quantiles
arr = np.array([4,6,9,10,2,25,12,6,9,])
np.quantile(arr, (0.25, 0.50, 0.75))
quantiles(arr)
When I calculate with numpy the result:
array([ 6., 9., 10.])
When I calculate with statistics the result:
[5.0, 9.0, 11.0]
So which library is calculating correctly?
Upvotes: 3
Views: 4718
Reputation: 4045
In fact, MATLAB even returns a 3rd option: [5.5 9.0 10.5] Your question is reasonable: How can that be?
Let us first recall the definition of quantiles:
In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities (...). q-quantiles are values that partition a finite set of values into q subsets of (nearly) equal sizes. From the introduction of Wikipedia.
The problem are odd-sized populations/groups. (Hereis an example (Wikipedia).) You will have to choose how what to do with the fractions. The question is whether to include the point of division or not.
Apparently, numpy
chose to include the point resulting in a round up (ceil
) for higher quantiles and to round down (floor
) for the lower quantiles; while statistics
decided to not include the boundary point; and MATLAB simply provides the exact boundary -- not necessarily being a part of the set.
Now to the real question: what is correct? All of them. The difference will be negligible at larger groups/populations as are typical for statistics;)
Upvotes: 2
Reputation: 2119
The built-in statistics.quantiles'
default method is “exclusive”, however the numpy.quantile
is inclusive. If you write
quantiles(arr, method='inclusive')
you get the same as numpy's answer. You should read the docs to find out which one suits your needs.
Upvotes: 4