Reputation: 2890
So I'm trying to colour my bar plot of students grades, depending on if they passed a certain exam or not.
To do this I read in two separate .csv files and structured the data so it's ordered by a certain variable. Here's the head of one of the datasets:
head(Y)
fName mid lab exam overall
1 OOJOE 50 94 77.5 77
2 JWTWB 45 50 60.5 54
3 XQXQA 65 78 69.0 71
4 PVTMX 35 84 30.5 47
5 ZZBDP 70 100 74.0 81
6 JVYMA 65 96 73.5 79
The other data set (X) contains info about student attendance etc. I made a boxplot for each student, showing the median attendance (using dataset X) but my overall goal is to colour the boxplots depending on whether each student got an overall grade of 40 or above (which is from data set Y).
I'm using ggplot... Here's what I've tried so far:
ggplot(data=X,aes(x=fName, y=delay, group=fName)) +
geom_boxplot(color = Y$overall <40) +
scale_colour_manual(name = 'overall < 40', values = setNames(c('red','green'),c(T, F))) +
coord_flip()
But this just tells me that it's an invalid colour name... I also tried:
ggplot(data=X,aes(x=fName, y=delay, group=fName)) +
geom_boxplot(color = ifelse(Y$overall >= 40,'red','green')) +
coord_flip()
This does separate the boxplots into two different colours... but it's not doing it correctly (i.e., it's not colouring the values >= 40 red and all the others green... it just seems to be randomly assigning some students red and some green). I suspect it's not working because of the ifelse
command not working well with ggplot
, but I'm not sure.
Any suggestions as to how I'd fix this?
EDIT: Here's the head of X, just to show an example of the delay column:
head(X)
fName Information fTime min.Time. delay
1 ARONR Course outline 2010-09-22T09:16:00Z 2010-09-20T20:21:00Z 1.5381944
2 ARONR Lab Dec 13 2010-12-11T17:21:00Z 2010-12-09T12:20:00Z 2.2090278
3 ARONR Lab Nov 1 2010-11-03T11:10:00Z 2010-10-28T17:21:00Z 5.7423611
4 ARONR Lab Nov 22 2010-11-22T14:16:00Z 2010-11-22T11:51:00Z 0.1006944
5 ARONR Lab Nov 29 2010-11-29T15:04:00Z 2010-11-25T18:00:00Z 3.8777778
6 ARONR Lab Nov 8 2010-11-10T11:07:00Z 2010-11-05T19:12:00Z 4.6631944
Here's some additional data concerning only one student to help merging under common key: Y-data set:
fName mid lab exam overall
ZZBDP 70 100 74.0 81
X-data set:
fName Information fTime min.Time. delay
ZZBDP Lecture Dec 1 2010-12-01T13:02:00Z 2010-12-01T12:31:00Z 2.152778e-02 ZZBDP Lecture Dec 8 2010-12-08T08:49:00Z 2010-12-07T16:43:00Z 6.708333e-01
ZZBDP Lecture Nov 10 2010-11-10T11:14:00Z 2010-11-09T13:35:00Z 9.020833e-01
ZZBDP Lecture Nov 17 2010-11-17T18:25:00Z 2010-11-17T10:31:00Z 3.291667e-01
ZZBDP Lecture Nov 24 2010-11-24T09:23:00Z 2010-11-23T11:35:00Z 9.083333e-01
Upvotes: 1
Views: 661
Reputation: 1475
By converting the updated X
and Y
data.frames to data.tables and then merging them on fName
I get the following data.table
(dt
):
library(data.table)
library(ggplot2)
X <- structure(list(fName = c("ZZBDP", "ZZBDP", "ZZBDP", "ZZBDP",
"ZZBDP"), Information = c("Lecture Dec 1", "Lecture Dec 8", "Lecture Nov 10",
"Lecture Nov 17", "Lecture Nov 24"), fTime = c("2010-12-01T13:02:00Z",
"2010-12-08T08:49:00Z", "2010-11-10T11:14:00Z", "2010-11-17T18:25:00Z",
"2010-11-24T09:23:00Z"), min.Time. = c("2010-12-01T12:31:00Z",
"2010-12-07T16:43:00Z", "2010-11-09T13:35:00Z", "2010-11-17T10:31:00Z",
"2010-11-23T11:35:00Z"), delay = c(0.0215, 0.671, 0.902, 0.329,
0.908)), .Names = c("fName", "Information", "fTime", "min.Time.",
"delay"), row.names = c(NA, -5L), class = "data.frame")
Y <- structure(list(fName = c("OOJOE", "JWTWB", "XQXQA", "PVTMX",
"ZZBDP", "JVYMA"), mid = c(50L, 45L, 65L, 35L, 70L, 65L), lab = c(94L,
50L, 78L, 84L, 100L, 96L), exam = c(77.5, 60.5, 69, 30.5, 74,
73.5), overall = c(77L, 54L, 71L, 47L, 81L, 79L)), .Names = c("fName",
"mid", "lab", "exam", "overall"), row.names = c(NA, -6L), class = "data.frame")
# Convert to data.table
setDT(X)
setDT(Y)
# Merge X and Y on fName and store in dt
dt <- Y[X, on="fName"]
>dt
fName mid lab exam overall Information fTime min.Time. delay
1: ZZBDP 70 100 74 81 Lecture Dec 1 2010-12-01T13:02:00Z 2010-12-01T12:31:00Z 0.0215
2: ZZBDP 70 100 74 81 Lecture Dec 8 2010-12-08T08:49:00Z 2010-12-07T16:43:00Z 0.6710
3: ZZBDP 70 100 74 81 Lecture Nov 10 2010-11-10T11:14:00Z 2010-11-09T13:35:00Z 0.9020
4: ZZBDP 70 100 74 81 Lecture Nov 17 2010-11-17T18:25:00Z 2010-11-17T10:31:00Z 0.3290
5: ZZBDP 70 100 74 81 Lecture Nov 24 2010-11-24T09:23:00Z 2010-11-23T11:35:00Z 0.9080
The above data.table contains both the independent variable (fName
), the dependent variable (delay
), and the variable to use for colouring (overall
).
To make a boxplot of delay
vs fName
with overall
scores greater than or equal to 40 being colored red (and those below 40 colored green), use:
ggplot(dt, aes(x = fName, y = delay, group = fName, color = overall >= 40)) +
geom_boxplot() + scale_color_manual(values = c("red", "green")) +
coord_flip()
Upvotes: 1