Reputation: 1826
UPDATE July 2020:
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.
ORIGINAL QUESTION:
I want to refer to an unknown column name inside a summarise
. The standard evaluation functions introduced in dplyr 0.3
allow column names to be referenced using variables, but this doesn't appear to work when you call a base
R function within e.g. a summarise
.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
The df looks like this:
> df
Source: local data frame [5 x 3]
v1 v2 v3
1 1 6 A
2 2 7 A
3 3 8 A
4 4 9 B
5 5 10 B
I want to drop v1, group by v3, and sum v2 for each group:
df %>% select(-matches(drp)) %>% group_by_(key) %>% summarise_(sum(val, na.rm = TRUE))
Error in sum(val, na.rm = TRUE) : invalid 'type' (character) of argument
The NSE version of select()
works fine, since it can match a character string. The SE version of group_by()
works fine, since it can now accept variables as arguments and evaluate them. However, I haven't found a way to achieve similar results when using base R functions inside dplyr
functions.
Things that don't work:
df %>% group_by_(key) %>% summarise_(sum(get(val), na.rm = TRUE))
Error in get(val) : object 'v2' not found
df %>% group_by_(key) %>% summarise_(sum(eval(as.symbol(val)), na.rm = TRUE))
Error in eval(expr, envir, enclos) : object 'v2' not found
I've checked out several related questions, but none of the proposed solutions have worked for me so far.
Upvotes: 67
Views: 32707
Reputation: 1826
dplyr
1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr
programming vignette here:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The new way to refer to columns when their identifier is stored as a character vector is to use the .data
pronoun from rlang
, and then subset as you would in base R.
library(dplyr)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
select(-matches(drp)) %>%
group_by(.data[[key]]) %>%
summarise(total = sum(.data[[val]], na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> v3 total
#> <chr> <int>
#> 1 A 21
#> 2 B 19
If your code is in a package function, you can @importFrom rlang .data
to avoid R check notes about undefined globals.
Upvotes: 15
Reputation: 6278
With the release of the rlang package and the 0.7.0 update to dplyr, this is now fairly simple.
When you want to use a character string (e.g. "v1") as a variable name, you just:
sym()
from the rlang package!!
in front of the symbolFor instance, you'd do the following:
my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean(!!my_sym))
More compactly, you could combine the step of converting your string to a symbol with sym()
and prefixing it with !!
when writing your function call.
For instance, you could write:
my_var <- "Sepal.Length"
summarize(iris, mean(!!sym(my_var)))
To return to your original example, you could do the following:
library(rlang)
key <- "v3"
val <- "v2"
drp <- "v1"
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>%
# NOTE: we don't have to do anything to `drp`
# since the matches() function expects a character string
select(-matches(drp)) %>%
group_by(!!sym(key)) %>%
summarise(sum(!!sym(val), na.rm = TRUE))
Alternative Syntax
With the release of rlang version 0.4.0, you can use the following syntax:
my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean({{ my_sym }}))
Instead of writing !!my_sym
, you can write {{ my_sym }}
. This has the advantage of being arguably clearer, but has the disadvantage that you have to convert the string to a symbol before placing it inside the brackets. For instance, you can write !!sym(my_var)
but you can't write {{sym(my_var)}}
Additional details
Of all the official documentation explaining how the usage of sym()
and !!
works, these seem to be the most accessible:
Upvotes: 42
Reputation: 67778
Please note that this answer does not apply to dplyr >= 0.7.0
, but to previous versions.
[
dplyr 0.7.0
] has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail invignette("programming")
.
The dplyr
vignette on non-standard evalutation is helpful here. Check the section "Mixing constants and variables" and you find that the function interp
from package lazyeval
could be used, and "[u]se as.name
if you have a character string that gives a variable name":
library(lazyeval)
df %>%
select(-matches(drp)) %>%
group_by_(key) %>%
summarise_(sum_val = interp(~sum(var, na.rm = TRUE), var = as.name(val)))
# v3 sum_val
# 1 A 21
# 2 B 19
Upvotes: 55
Reputation: 28441
New dplyr update:
The new functionality of dplyr can help with this. Instead of strings for the variables that need non-standard evaluation, we use quosures quo()
. We undo the quoting with another function !!
. For more on these see this vignette. You will need the developer's version of dplyr until the full release.
library(dplyr) #0.5.0.9004+
key <- quo(v3)
val <- quo(v2)
drp <- "v1"
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>% select(-matches("v1")) %>%
group_by(!!key) %>%
summarise(sum(!!val, na.rm = TRUE))
# # A tibble: 2 × 2
# v3 `sum(v2, na.rm = TRUE)`
# <chr> <int>
# 1 A 21
# 2 B 19
Upvotes: 9
Reputation: 269421
Pass the .dots
argument a list of strings constructing the strings using paste
, sprintf
or using string interpolation from package gsubfn via fn$list
in place of list
as we do here:
library(gsubfn)
df %>%
group_by_(key) %>%
summarise_(.dots = fn$list(mean = "mean($val)", sd = "sd($val)"))
giving:
Source: local data frame [2 x 3]
v3 mean sd
1 A 7.0 1.0000000
2 B 9.5 0.7071068
Upvotes: 9