D. Studer
D. Studer

Reputation: 1875

R: identify outliers and mark them in a boxplot

I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:

n <- 1000

dat <- data.frame(user = 1:n, 
                  question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
                  time = round(rnorm(n, mean = 10, sd=4), 0)
                  )
dat %>%
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  ggtitle("Answering time per question")

Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?

Upvotes: 0

Views: 284

Answers (1)

Jon Spring
Jon Spring

Reputation: 66490

library(dplyr)
dat %>%
  group_by(question) %>%
  mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
  ungroup() %>%
  
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') +
  
  geom_point(data = . %>% filter(outlier), color = "red") +
  ggtitle("Answering time per question")

By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.

enter image description here

Upvotes: 1

Related Questions