Reputation: 391
The linear interpolation formula for percentiles is:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
Suppose I have this list with 16 observations:
test = [0, 1, 5, 5, 5, 6, 6, 7, 7, 8, 11, 12, 21, 23, 23, 24]
I pass it as a numpy array and calculate the 85th percentile using linear interpolation.
np_test = np.asarray(test)
np.percentile(np_test, 85, interpolation = 'linear')
The result I get is 22.5. However, I don't think that's correct. The index of the 85th percentile is .85 * 16 = 13.6. Thus, the fractional part is .6. The 13th value is 21, so i = 21. The 14th value is 23, so j = 23. The linear formula should then yield:
21 + (23 - 21) * .6 = 21 + 2 * .6 = 21 + 1.2 = 22.2
The correct answer is 22.2. Why am I getting 22.5 instead?
Upvotes: 7
Views: 8060
Reputation: 8378
len(test)
is 16 but the distance between last element and first element is 1 less, that is, d=16-1=15-0=15
. Therefore, index of 85th percentile is d*0.85 = 15*0.85 = 12.75
. test[12] = 21
and test[13] = 23
. Therefore, using linear interpolation for the fractional part, we get: 21 + 0.75 * (23 - 21) = 22.5
. The correct answer is 22.5.
From the Notes section of the documentation of numpy.percentile()
:
Given a vector V of length N, the q-th percentile of V is the value q/100 of the way from the mimumum to the maximum in in a sorted copy of V.
The key here is, in my opinion, "the way from the minimum to the maximum". Let's say we number elements from 1 to 16. Then the "position" of the first element is 1 and the "position" (along the "coordinate axis of indices") of the last element in test
is 16. Therefore the distance between them is 16-1=15
.
Upvotes: 11