ggplot structuring data boxplot of treatment effects in multiple time periods

Question

I have data currently structured like so:

set.seed(100)
require(ggplot2)
require(reshape2)


d<-data.frame("ID" = 1:30,
           "Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
           "Score1" = rnorm(30)^2,
           "Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
           "Score2" = rnorm(30)^2,
           "Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
           "Score3" = rnorm(30)^2)

Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.

d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))

I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:

ggplot(d.melt)+
  geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))

But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels... Any help getting my head around this problem would be great. Thank you in advance

eipi10 · Accepted Answer

The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment and their Score for each treatment. I've used one of the map functions from the purrr package (which is part of the tidyverse suite of packages) for this.

The code steps through each of the three pairs of treatments/scores, adds a column called Treatment indicating the treatment number and returns the stacked (long format) data frame.

library(tidyverse)

dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2), 
             function(t,s) {
               data.frame(ID = d[,"ID"], 
                          Treatment = gsub(".*([0-9]$)", "\1", names(d)[t]), 
                          Treat_Flag = d[,t], 
                          Score = d[,s])
             })

Now we plot the data using Treatment on the x-axis to mark the treatment number and color by Treat_Flag to provide separate box plots based on whether a given subject received a given treatment.

ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
  geom_boxplot() +
  theme_classic() +
  labs(colour="Treatment Indicator")

Here's another way to reshape the data. The code below uses functions from tidyr rather than from reshape2 (tidyr is the successor to reshape2). In the code below, gather(d, key, value, -ID) is essentially equivalent to melt(d, id.var="ID"). You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse paradigm for data reshaping, but I find it a bit less intuitive than the map approach above.

dr = gather(d, key, value, -ID) %>%
  separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
  spread(key, value) %>%
  rename(Treatment=value2, Treat_Flag=Treatment)

ggplot structuring data boxplot of treatment effects in multiple time periods

Answers (1)

Related Questions