Reputation: 406
I have data currently structured like so:
set.seed(100)
require(ggplot2)
require(reshape2)
d<-data.frame("ID" = 1:30,
"Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
"Score1" = rnorm(30)^2,
"Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
"Score2" = rnorm(30)^2,
"Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
"Score3" = rnorm(30)^2)
Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.
d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))
I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:
ggplot(d.melt)+
geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))
But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels... Any help getting my head around this problem would be great. Thank you in advance
Upvotes: 1
Views: 779
Reputation: 93851
The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment
and their Score
for each treatment. I've used one of the map
functions from the purrr
package (which is part of the tidyverse
suite of packages) for this.
The code steps through each of the three pairs of treatments/scores, adds a column called Treatment
indicating the treatment number and returns the stacked (long format) data frame.
library(tidyverse)
dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2),
function(t,s) {
data.frame(ID = d[,"ID"],
Treatment = gsub(".*([0-9]$)", "\\1", names(d)[t]),
Treat_Flag = d[,t],
Score = d[,s])
})
Now we plot the data using Treatment
on the x-axis to mark the treatment number and color by Treat_Flag
to provide separate box plots based on whether a given subject received a given treatment.
ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
geom_boxplot() +
theme_classic() +
labs(colour="Treatment Indicator")
Here's another way to reshape the data. The code below uses functions from tidyr
rather than from reshape2
(tidyr
is the successor to reshape2
). In the code below, gather(d, key, value, -ID)
is essentially equivalent to melt(d, id.var="ID")
. You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse
paradigm for data reshaping, but I find it a bit less intuitive than the map
approach above.
dr = gather(d, key, value, -ID) %>%
separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
spread(key, value) %>%
rename(Treatment=value2, Treat_Flag=Treatment)
Upvotes: 1