Magnus
Magnus

Reputation: 760

Are outliers used to calculate quantiles in box plots in ggplot2?

As stated in the title. I've browsed though a couple of articles and they're really quite vague on this subject. Are all values used when creating the quantiles in a box plot (Q1, Q2, Q3), or only the ones in the "data range" (that is to say, the ones within 1,5 times the inter-quartile range from Q1 or Q3)

I'm creating my boxplots using the ggplot2 package. I write:

fulldata %>%
  filter(status=="påbörjat studier") %>%
  ggplot(aes(x=fct_reorder(urvalsgrupp, PERC_CREDIT, .fun = median), y=PERC_CREDIT)) +
  geom_boxplot() +
  coord_flip()

And I get: enter image description here

Now as you can see there are two outliers in the HP group. Were these outliers used when calculating the quantiles, or should the box/quantiles (if these values were taken into account) be placed further to the left?

Upvotes: 0

Views: 353

Answers (1)

Magnus
Magnus

Reputation: 760

I can't find a straight answer in the documentation, but we can study this empirically. First we create a subset of the data consisting of the HP group filtered the same way as the dplyr chain above:

dftest<-fulldata%>%filter(urvalsgrupp=="HP" & status=="påbörjat studier")

Then we can calculate the quantiles manually:

quantile(dftest$PERC_CREDIT,probs=c(0.25,0.50,0.75))

Output:

25%       50%       75% 
0.4277778 0.6000000 0.6500000 

This seems roughly equivalent to the values in our first boxplot for our HP group. While we can't draw any definite conclusions (we can have several observations with exactly the same PERC_CREDIT), the result points towards all values being used to calculate the quantiles, even outliers.

Upvotes: 1

Related Questions