Reputation: 1875
I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
ggtitle("Answering time per question")
Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?
Upvotes: 0
Views: 284
Reputation: 66490
library(dplyr)
dat %>%
group_by(question) %>%
mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
ungroup() %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
geom_point(data = . %>% filter(outlier), color = "red") +
ggtitle("Answering time per question")
By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.
Upvotes: 1