D. Studer
D. Studer

Reputation: 1875

Overlaying boxplot with a lineplot

I have some fake data representing the answering times of different users answering an online survey. The dataset has three variables: the id of the respondent (user), the name of the question (question) and the answering time for each question (time).


n <- 1000
dat <- data.frame(user = 1:n, 
                  question = sample(paste("q", 1:4, sep = ""), size = n, replace = TRUE),
                  time = round(rnorm(n, mean = 10, sd=4), 0)
                  )

pltSingleRespondent <- function(df, highlightUsers){
  dat %>%
    ggplot(aes(x = question, y = time)) + 
    geom_boxplot(fill = 'orange') + coord_flip() +
    ggtitle("Answering time per question")
}


pltSingleRespondent(dat, c(1, 31) )

I was creating a function that plots a boxplot with the answering times for each question. However, now I'd like to overlay that plot with the answering times of specific respondents (highlightUsers). The following image shows an example:

Text

Can someone please explain me how to do this?

Upvotes: 2

Views: 791

Answers (2)

neilfws
neilfws

Reputation: 33782

Slightly different approach. Add a column to the data that indicates the highlighted users and map that variable to geom_line. Use scale_color_discrete(na.translate = FALSE) to color only the non-NA values.

library(dplyr)
library(ggplot2)

pltSingleRespondent <- function(df, highlightUsers) {
  df %>% 
    mutate(User = factor(ifelse(user %in% highlightUsers, user, NA))) %>% 
    ggplot(aes(question, time)) +
    geom_boxplot(fill = "orange") +
    geom_line(aes(color = User, group = User)) + 
    ggtitle("Answering time per question") +
    scale_color_discrete(na.translate = FALSE) + 
    coord_flip() + 
    theme_bw()
}

Using the example data from @r2evans

pltSingleRespondent(dat, c(1, 34))

enter image description here

Upvotes: 1

r2evans
r2evans

Reputation: 160577

I think the most direct way to do this is to subset your data within a call to geom_line.

I'll start with a different set of random data, since the sample data in the question does not include all questions for a user.

set.seed(2021)
dat <- expand.grid(user = factor(1:50), question = paste0("q", 1:4))
dat$time <- round(rnorm(200, mean = 10, sd = 4), 0)

dat %>%
  ggplot(aes(x = question, y = time)) + 
  geom_boxplot(fill = 'orange') + coord_flip() +
  ggtitle("Answering time per question") +
  geom_line(aes(color = user, group = user), size = 2,
            data = ~ subset(., user %in% c(1L, 34L)))

ggplot with lines for two users

You can functionize it however you want. If you're using dplyr, you can use dplyr::filter instead of subset with no other change.

Also, I chose to factor(user), since otherwise ggplot2 tends to think its data is continuous (for color=user). You can choose to use or not use this, though you may need more wrangling to get it to be discrete.

Upvotes: 4

Related Questions