Pikada
Pikada

Reputation: 31

How do I easily find boxplot outliers

Below is an example using the mtcars dataset. There is one outlier with a value of 33.9, but I want a function that finds all of them for a given column.

library(dplyr)
library(ggplot2)

mtcars %>%
  ggplot(aes(x = "", y = mpg)) +
  geom_boxplot(fill = "#2645df")

I do not know the formula for boxplot whisker limits, so I used the plot above to find that value and then changed it manually:

res = ifelse(mtcars$mpg > 33, "outlier", "not outlier")
res = ifelse(mtcars$mpg < 10, "outlier", "not outlier")

This approach is both inefficient, and incorrect: 33 is not the lower limit for outliers, neither is 10.

Upvotes: 1

Views: 97

Answers (3)

Axeman
Axeman

Reputation: 35382

You can use boxplot.stats:

my_outliers <- function(x, coef = 1.5) boxplot.stats(x, coef = coef)$out

This is what graphics::boxplot uses. This works slightly differently from what ggplot does, which I think is equivalent to:

my_outliers2 <- function(x, coef = 1.5) {
  x[x > quantile(x, 0.75) + IQR(x) * coef | x < quantile(x, 0.25) - IQR(x) * coef]
}

Upvotes: 1

Pikada
Pikada

Reputation: 31

I was able to achieve my desired output. Using the formula for boxplot outliers I was able to make two neat functions that not only serve the desired purpose, but also work within the tidyverse semantics:

# smaller function to find the boxplot wisker limits:

outlierLimits = function(x, extreme = F){
  qts = quantile(x, c(.25, .75), names = F)
  
  IQR = qts[2] - qts[1]
  
  ret = c(
    
    lower = qts[1] - IQR*1.5,
    upper = qts[2] + IQR*1.5,
    lower.extreme = qts[1] - IQR*3,
    upper.extreme = qts[2] + IQR*3
    
  )[c(T, T, extreme, extreme)]
  
  return(ret)
}
# The function I was looking for:

outlierClassify = function(x, extreme = F,
                           labels = c("regular", "outlier",
                                      "extreme")[c(T,T,extreme)]){
  lims = outlierLimits( x, extreme )
  
  ret = ifelse(x > lims[1] & x < lims[2],
               labels[1], labels[2])
  
  if(extreme){
    ret[ ret != labels[1] ] = ifelse(
      
      x[ ret != labels[1] ] > lims[3] & 
        x[ ret != labels[1] ] < lims[4],
      
      labels[2], labels[3]
    )
  }
  return(ret)
}

This way, the outlierClassify function returns a character vector that relates to the input vector x.

Some great use case examples are:

# simply obtaining the resulted vector

outlierClassify(mtcars$mpg, F)
# using it with mutate()
library(dplyr)

test = mtcars %>%
  select(mpg, cyl) %>%
  mutate(car = rownames(mtcars),
         .before = 1) %>% 

  # added an 'extreme' oulier for examplification
  rbind(data.frame(
    car = "UNO Mille", mpg = 34, cyl = 6
  )) %>% 
  group_by(cyl) %>% 
  mutate(outliers = outlierClassify(mpg, T),
         .after = mpg)
# using it with ggplot
library(ggplot2)

test %>% 
  ggplot(aes(x = as.factor(cyl), y = mpg))+
  geom_boxplot(outlier.shape = NA, fill = "#2645df", alpha = .6)+
  geom_jitter(aes(color = outliers), width = .1)+
  #making it pretty
  scale_color_manual(values = c("red", "darkorange", "black"))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "wheat3"))

Upvotes: 2

Michiel Duvekot
Michiel Duvekot

Reputation: 1881

If you want to see the outliers for each column of mtcars, you could normalize their values (you won't see anything if you don't) and then pivot_wider and plot with outliers = TRUE.

mtcars %>%
  mutate(across(everything(), scale)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot(
    outliers = TRUE,
    fill = "#2645df")  

Upvotes: 0

Related Questions