Reputation: 26570

ggplot: aes vs aes_string, or how to programmatically specify column names?

Let's assume we have the following data frame

data <- data.frame(time=1:10, y1=runif(10), y2=runif(10), y3=runif(10))

and we want to create a plot like this:

p <- ggplot(data, aes(x=time))
p <- p + geom_line(aes(y=y1, colour="y1"))
p <- p + geom_line(aes(y=y2, colour="y2"))
p <- p + geom_line(aes(y=y3, colour="y3"))
plot(p)

enter image description here

But what if we have much more "y" columns, and we do not know their exact name. This raises the question: How can we iterate over all columns programmatically, and add them to the plot? Basically the goal is:

otherFeatures <- names(data)[-1]
for (f in otherFeatures) {
  # what goes here?
}

Failed Attempts

So far I have found many ways that do not work. For instance (all following examples only show the code line in the above for loop):

My first try was simply to use aes_string instead of aes in order to specify the column name by the loop variable f:

p <- p + geom_line(aes_string(y=f, colour=f))

But this does not give the same result, because now colour will not be a fixed color for each line (aes_string will interpret f in the data frame environment). As a result, the legend will become a color bar, and does not contain the different column names. My next guess was to mix aes and aes_string, trying to set colour to a fixed string:

p <- p + geom_line(aes_string(y=f), aes(colour=f))

But this results in Error: ggplot2 doesn't know how to deal with data of class uneval. My next attempt was to use colour "absolutely" (not within aes) like this:

p <- p + geom_line(aes_string(y=f), colour=f)

But this gives Error: invalid color name 'y1' (and I don't want to pick some proper color names manually either). The next try was to go back to aes only, replicating the manual approach:

p <- p + geom_line(aes(y=data[[f]], colour=f))

This does not give an error, but will only plot the last column. This makes sense, since aes will probably call substitute, and the expression will always be evaluated with the last value of f in the loop (rm f before calling plot(p) gives an error, indicating that the evaluation happens after the loop).

To rephrase the question: What kind of substitute/eval/quote magic is necessary to replicate the simple code from above within a for loop?

Upvotes: 10

Answers (3)

Michael

Reputation: 66

This is old now but in case anyone else comes across it, I had a very similar problem that was driving me crazy. The solution I found was to pass aes_q() to geom_line() using the as.name() option. You can find details on aes_q() here. Below is the way I would solve this problem, though the same principle should work in a loop. Note that I add multiple variables with geom_line() as a list here, which generalizes better (including to one variable).

varnames <- c("y1", "y2", "y3")
add_lines <- lapply(varnames, function(i) geom_line(aes_q(y = as.name(i), colour = i)))

p <- ggplot(data, aes(x = time))
p <- p + add_lines
plot(p)

Hope that helps!

Upvotes: 5

Nicolas De Jay

Reputation: 444

NOTE: This is not really an answer, just a very partial explanation of what is going on behind the scenes that might set on you on the right track. I have to admit my understanding of NSE is still very basic.

I have struggled and am still struggling with this particular issue. I have narrowed down the issue to NSE. I am not familiar with R's native substitute/quote/eval stuff, so I am going to demonstrate using the lazyeval package.

library(lazyeval)

a <- lapply(c(1:9,13), function(i) lazy(i))

head(a)
# [[1]]
# <lazy>
#   expr: c(1, 2, 3, 4, 5, 6, 7, 8, 9, 13)[[10L]]
#   env:  <environment: 0x25889a00>
# 
# [[2]]
# <lazy>
#   expr: c(1, 2, 3, 4, 5, 6, 7, 8, 9, 13)[[10L]]
#   env:  <environment: 0x25889a00>
#
# ...........

lazy_eval(a[[1]])
# [1] 13

lazy_eval(a[[2]])
# [1] 13

I think this happens because lazy(i) binds to the promise of i. By the time we get to evaluating any of these i evaluations, i is whatever was LAST assigned to it -- in this case, 13. Perhaps this is due to the environment in which i is evaluated being shared over all iterations of the lapply function?

I have had to resort to the same workarounds as you through aes_string and aes_q. I found them quite unsatisfactory as they neither (1) fully consistent with aes behavior and (2) particularly clean. Oh, the joys of learning NSE ;)

You can find the source code of the + and aes operators here:

ggplot2:::`+.gg`
ggplot2:::aes
ggplot2:::aes_q
ggplot2:::aes_string

Upvotes: 1

blakeoft

Reputation: 2400

You could melt (thanks for reminding me of this function, rawr) all of your data into a few columns. For example, it could look like this:

library(reshape2)    
data2 <- melt(data, id = "time")
head(data2)
#    time variable       value
# 1     1       y1 0.353088575
# 2     2       y1 0.621565368
# 3     3       y1 0.696031085
# 4     4       y1 0.507112969
# 5     5       y1 0.009560710
# 6     6       y1 0.158993988
ggplot(data2, aes(x = time, y = value, color = variable)) + geom_line()

enter image description here

Upvotes: 3

ggplot: aes vs aes_string, or how to programmatically specify column names?

Failed Attempts

Answers (3)

Related Questions