Reputation: 4248
I'm working with this WNBA dataset here. I'm analyzing the Height
variable, and below is a table showing frequency, cumulative percentage, and cumulative frequency for each height value recorded:
From the table I can easily conclude that the first quartile (the 25th percentile) cannot be larger than 175.
However, when I use Series.describe()
, I'm told that the 25th percentile is 176.5. Why is that so?
wnba.Height.describe()
count 143.000000
mean 184.566434
std 8.685068
min 165.000000
25% 176.500000
50% 185.000000
75% 191.000000
max 206.000000
Name: Height, dtype: float64
Upvotes: 5
Views: 2023
Reputation: 1094
There are various ways to estimate the quantiles.
The 175.0 vs 176.5 relates to two different methods:
The estimation differs as follows
#1
h = (N − 1)*p + 1 #p being 0.25 in your case
Est_Quantile = x⌊h⌋ + (h − ⌊h⌋)*(x⌊h⌋ + 1 − x⌊h⌋)
#2
h = (N + 1)*p
x⌊h⌋ + (h − ⌊h⌋)*(x⌊h⌋ + 1 − x⌊h⌋)
Upvotes: 4
Reputation: 1017
That is because by default describe()
does a linear interpolation.
So, no pandas
is not showing the wrong percentile
(it is just not showing the percentile you want to see).
To get what you expect you can use .quantile()
on Height
series, specifying interpolation to 'lower'
:
df = pd.read_csv('../input/WNBA Stats.csv')
df.Height.quantile(0.25,interpolation='lower') #interpolation lower to get what you expect
See documentation for more options.
Note that as @jpp said:
There are many definitions of percentile
You can see this answer too that talks about differences between numpy
and pandas
percentiles calculation for instance.
Upvotes: 1
Reputation: 164693
This is a statistics problem. There are many definitions of percentile. Here is one explanation why you would add 1 in calculating your 25th percentile index:
One intuitive answer is that the average of the numbers 1 through n is not n/2 but rather (n+1)/2. So this gives you a hint that simply using p*n would produce values that are slightly too small.
Resources:
Upvotes: 1