Reputation: 359
I'm trying to calculate the quartiles for an array of values in python using numpy
.
X = [1, 1, 1, 3, 4, 5, 5, 7, 8, 9, 10, 1000]
I would do the following:
quartiles = np.percentile(X, range(0, 100, 25))
quartiles
# array([1. , 2.5 , 5. , 8.25])
But this is incorrect, as the 1st and 3rd quartiles should be 2 and 8.5, respectively.
This can be shown as the following:
Q1 = np.median(X[:len(X)/2])
Q3 = np.median(X[len(X):])
Q1, Q3
# (2.0, 8.5)
I can't get my heads round what np.percentile
is doing to give a different answer. Any light shed on this, I'd be very grateful for.
Upvotes: 1
Views: 6113
Reputation: 7476
There is no right or wrong, but simply different ways of calculating percentiles The percentile is a well defined concept in the continuous case, less so for discrete samples: different methods would not make a difference for a very big number of observations (compared to the number of duplicates), but can actually matter for small samples and you need to figure out what makes more sense case by case.
To obtain you desired output, you should specify interpolation = 'midpoint' in the percentile function:
quartiles = np.percentile(X, range(0, 100, 25), interpolation = 'midpoint')
quartiles # array([ 1. , 2. , 5. , 8.5])
I'd suggest you to have a look at the docs http://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
Upvotes: 3