Reputation: 760
As stated in the title. I've browsed though a couple of articles and they're really quite vague on this subject. Are all values used when creating the quantiles in a box plot (Q1, Q2, Q3), or only the ones in the "data range" (that is to say, the ones within 1,5 times the inter-quartile range from Q1 or Q3)
I'm creating my boxplots using the ggplot2 package. I write:
fulldata %>%
filter(status=="påbörjat studier") %>%
ggplot(aes(x=fct_reorder(urvalsgrupp, PERC_CREDIT, .fun = median), y=PERC_CREDIT)) +
geom_boxplot() +
coord_flip()
Now as you can see there are two outliers in the HP group. Were these outliers used when calculating the quantiles, or should the box/quantiles (if these values were taken into account) be placed further to the left?
Upvotes: 0
Views: 353
Reputation: 760
I can't find a straight answer in the documentation, but we can study this empirically. First we create a subset of the data consisting of the HP group filtered the same way as the dplyr chain above:
dftest<-fulldata%>%filter(urvalsgrupp=="HP" & status=="påbörjat studier")
Then we can calculate the quantiles manually:
quantile(dftest$PERC_CREDIT,probs=c(0.25,0.50,0.75))
Output:
25% 50% 75%
0.4277778 0.6000000 0.6500000
This seems roughly equivalent to the values in our first boxplot for our HP group. While we can't draw any definite conclusions (we can have several observations with exactly the same PERC_CREDIT), the result points towards all values being used to calculate the quantiles, even outliers.
Upvotes: 1