Reputation: 32721
I calculated the upper quartile (Q3 or 75%-tile) and lower quartile (Q1 or 25%-tile) using Numpy/Pandas and TI-nspire. But I get different values. Why does this happen?
From (5+8)/2=6.5 and (18+21)/2=19.5, Numpy/Pandas Q1 and Q3 are wrong. Why does Numpy/Pandas return wrong numbers?
import numpy as np
data=np.array([2,4,5,8,10,11,12,14,17,18,21,22,25])
q75, q25 = np.percentile(data, [75 ,25])
print(q75,q25)
df=pd.DataFrame(data)
df.describe()
Numpy returns 18.0 and 8.0. Pandas return 18.0 and 8.0. But TI-nspire returns 19.5 and 6.5.
Upvotes: 1
Views: 755
Reputation: 31145
You are in for a treat. They are both right.
Unlike most other descriptors there are are several different definitions of Q1 and Q3 in use. For dataset with a large number of observations the different definitions will give the more-or-less the same result. For small datasets you will see differences - as you experienced.
Mathword lists 5 (five!) different ways of computing quartiles. See http://mathworld.wolfram.com/Quartile.html
Upvotes: 1
Reputation: 32721
This post and this post helped me understand it.
So if you have [7, 15, 36, 39, 40, 41], then 7 -> 0%, 15 -> 20%, 36 -> 40%, 39 -> 60%, 40 -> 80%, 41 -> 100%.
The default of interpolation
is linear. So it uses i + (j - i) * fraction. You can set interpolation to midpoint which calculate (i + j) / 2.
import numpy as np
data=np.array([7,15,36,39,40,41])
linear = np.percentile(data, [25, 50, 75], interpolation='linear')
mid = np.percentile(data, [25, 50, 75], interpolation='midpoint')
low = np.percentile(data, [25, 50, 75], interpolation='lower')
high = np.percentile(data, [25, 50, 75], interpolation='higher')
nearest = np.percentile(data, [25, 50, 75], interpolation='nearest')
print(linear,mid,low,high,nearest)
print(15,37.5,40)
Output:
So I found there is no exact way you find the Q1 and Q3 in Pandas/Numpy as TI-nspire.
Upvotes: 1