R: identify outliers and mark them in a boxplot

Question

I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:

n <- 1000

dat <- data.frame(user = 1:n, 
                  question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
                  time = round(rnorm(n, mean = 10, sd=4), 0)
                  )
dat %>%
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  ggtitle("Answering time per question")

Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?

Jon Spring · Accepted Answer

library(dplyr)
dat %>%
  group_by(question) %>%
  mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
  ungroup() %>%
  
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  
  geom_point(data = . %>% filter(outlier), color = "red") +
  ggtitle("Answering time per question")

By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.

R: identify outliers and mark them in a boxplot

Answers (1)

Related Questions