Electrino
Electrino

Reputation: 2890

Bar plot colour depending on conditional value in R

So I'm trying to colour my bar plot of students grades, depending on if they passed a certain exam or not.

To do this I read in two separate .csv files and structured the data so it's ordered by a certain variable. Here's the head of one of the datasets:

 head(Y)
  fName mid lab exam overall
1 OOJOE  50  94 77.5      77
2 JWTWB  45  50 60.5      54
3 XQXQA  65  78 69.0      71
4 PVTMX  35  84 30.5      47
5 ZZBDP  70 100 74.0      81
6 JVYMA  65  96 73.5      79

The other data set (X) contains info about student attendance etc. I made a boxplot for each student, showing the median attendance (using dataset X) but my overall goal is to colour the boxplots depending on whether each student got an overall grade of 40 or above (which is from data set Y).

I'm using ggplot... Here's what I've tried so far:

ggplot(data=X,aes(x=fName, y=delay, group=fName)) + 
  geom_boxplot(color = Y$overall <40) + 
  scale_colour_manual(name = 'overall < 40', values = setNames(c('red','green'),c(T, F))) + 
  coord_flip()

But this just tells me that it's an invalid colour name... I also tried:

ggplot(data=X,aes(x=fName, y=delay, group=fName)) + 
  geom_boxplot(color =  ifelse(Y$overall >= 40,'red','green')) +
  coord_flip()

This does separate the boxplots into two different colours... but it's not doing it correctly (i.e., it's not colouring the values >= 40 red and all the others green... it just seems to be randomly assigning some students red and some green). I suspect it's not working because of the ifelse command not working well with ggplot, but I'm not sure.

Any suggestions as to how I'd fix this?

EDIT: Here's the head of X, just to show an example of the delay column:

head(X)
  fName    Information                fTime            min.Time.     delay
1 ARONR Course outline 2010-09-22T09:16:00Z 2010-09-20T20:21:00Z 1.5381944
2 ARONR    Lab  Dec 13 2010-12-11T17:21:00Z 2010-12-09T12:20:00Z 2.2090278
3 ARONR      Lab Nov 1 2010-11-03T11:10:00Z 2010-10-28T17:21:00Z 5.7423611
4 ARONR     Lab Nov 22 2010-11-22T14:16:00Z 2010-11-22T11:51:00Z 0.1006944
5 ARONR     Lab Nov 29 2010-11-29T15:04:00Z 2010-11-25T18:00:00Z 3.8777778
6 ARONR      Lab Nov 8 2010-11-10T11:07:00Z 2010-11-05T19:12:00Z 4.6631944

Here's some additional data concerning only one student to help merging under common key: Y-data set:

fName mid lab exam overall
ZZBDP  70 100 74.0      81

X-data set:

fName    Information                fTime            min.Time.     delay
ZZBDP  Lecture Dec 1 2010-12-01T13:02:00Z 2010-12-01T12:31:00Z 2.152778e-02                 ZZBDP  Lecture Dec 8 2010-12-08T08:49:00Z 2010-12-07T16:43:00Z 6.708333e-01
ZZBDP Lecture Nov 10 2010-11-10T11:14:00Z 2010-11-09T13:35:00Z 9.020833e-01
ZZBDP Lecture Nov 17 2010-11-17T18:25:00Z 2010-11-17T10:31:00Z 3.291667e-01
ZZBDP Lecture Nov 24 2010-11-24T09:23:00Z 2010-11-23T11:35:00Z 9.083333e-01

Upvotes: 1

Views: 661

Answers (1)

By converting the updated X and Y data.frames to data.tables and then merging them on fName I get the following data.table (dt):

library(data.table)
library(ggplot2)
X <- structure(list(fName = c("ZZBDP", "ZZBDP", "ZZBDP", "ZZBDP", 
      "ZZBDP"), Information = c("Lecture Dec 1", "Lecture Dec 8", "Lecture Nov 10", 
      "Lecture Nov 17", "Lecture Nov 24"), fTime = c("2010-12-01T13:02:00Z", 
      "2010-12-08T08:49:00Z", "2010-11-10T11:14:00Z", "2010-11-17T18:25:00Z", 
      "2010-11-24T09:23:00Z"), min.Time. = c("2010-12-01T12:31:00Z", 
      "2010-12-07T16:43:00Z", "2010-11-09T13:35:00Z", "2010-11-17T10:31:00Z", 
      "2010-11-23T11:35:00Z"), delay = c(0.0215, 0.671, 0.902, 0.329, 
      0.908)), .Names = c("fName", "Information", "fTime", "min.Time.", 
      "delay"), row.names = c(NA, -5L), class = "data.frame")  

Y <- structure(list(fName = c("OOJOE", "JWTWB", "XQXQA", "PVTMX", 
      "ZZBDP", "JVYMA"), mid = c(50L, 45L, 65L, 35L, 70L, 65L), lab = c(94L, 
      50L, 78L, 84L, 100L, 96L), exam = c(77.5, 60.5, 69, 30.5, 74, 
      73.5), overall = c(77L, 54L, 71L, 47L, 81L, 79L)), .Names = c("fName", 
      "mid", "lab", "exam", "overall"), row.names = c(NA, -6L), class = "data.frame")  

# Convert to data.table
setDT(X)
setDT(Y)

# Merge X and Y on fName and store in dt
dt <- Y[X, on="fName"]

>dt
   fName mid lab exam overall    Information                fTime            min.Time.  delay
1: ZZBDP  70 100   74      81  Lecture Dec 1 2010-12-01T13:02:00Z 2010-12-01T12:31:00Z 0.0215
2: ZZBDP  70 100   74      81  Lecture Dec 8 2010-12-08T08:49:00Z 2010-12-07T16:43:00Z 0.6710
3: ZZBDP  70 100   74      81 Lecture Nov 10 2010-11-10T11:14:00Z 2010-11-09T13:35:00Z 0.9020
4: ZZBDP  70 100   74      81 Lecture Nov 17 2010-11-17T18:25:00Z 2010-11-17T10:31:00Z 0.3290
5: ZZBDP  70 100   74      81 Lecture Nov 24 2010-11-24T09:23:00Z 2010-11-23T11:35:00Z 0.9080

The above data.table contains both the independent variable (fName), the dependent variable (delay), and the variable to use for colouring (overall).

To make a boxplot of delay vs fName with overall scores greater than or equal to 40 being colored red (and those below 40 colored green), use:

ggplot(dt, aes(x = fName, y = delay, group = fName, color = overall >= 40)) + 
  geom_boxplot() + scale_color_manual(values = c("red", "green")) +
  coord_flip()

enter image description here

Upvotes: 1

Related Questions