Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to use standard evaluation in dplyr summarise_

I have looked at several places but I just can't figure out how to do this. It looks like it has changed a few times so even more confusing

I want to summarise the NumOfBx by Endoscopist as part of a function. I have the following dataframe

vv <- structure(list(Endoscopist = c("John Boy ", "Jupi Ter ", "Jupi Ter ", 
"John Boy ", "John Boy ", "John Boy ", "Mar Gret ", "John Boy ", 
"Mar Gret ", "Phil Ip ", "Phil Ip "), NumbOfBx = c(2, 4, NA, 
2, 12, 12, NA, NA, NA, 3, NA)), row.names = 100:110, .Names = c("Endoscopist", 
"NumbOfBx"), class = "data.frame")

My function is:

NumBx <- function(x, y, z) {
  x <- data.frame(x)
  x <- x[!is.na(x[,y]), ]
  NumBxPlot <- x %>% group_by_(z) %>% summarise(avg = mean(y, na.rm = T))
}

which I call with:

NumBx(vv,"Endoscopist","NumOfBx)

This gives me the error:

Warning messages:
1: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
2: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
3: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA

I changed the function to use summarise_

but I get the same thing. Then I realised the need for summarise_ specifically (as opposed to group_by_) needing a standard evaluations and I tried this (from this stackoverflow example)

library(lazyeval)
NumBx <- function(x, y, z) {
  x <- data.frame(x)
  x <- x[!is.na(x[,y]), ]
  NumBxPlot <- x %>% group_by_(z) %>% 
      summarise_(sum_val = interp(~mean(y, na.rm = TRUE), var = as.name(y)))

but I still get the same error of:

Warning messages:
1: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
2: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
3: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA

My intended output is:

Endoscopist   Avg
Jupi Ter       4
John Boy       28
Phil Ip        3

Upvotes: 3

Views: 360

Answers (2)

alistaire
alistaire

Reputation: 43334

Using rlang (the replacement for lazyeval), you could do

library(dplyr)

vv <- structure(list(Endoscopist = c("John Boy ", "Jupi Ter ", "Jupi Ter ", "John Boy ", "John Boy ", "John Boy ", "Mar Gret ", "John Boy ", "Mar Gret ", "Phil Ip ", "Phil Ip "), 
                     NumbOfBx = c(2, 4, NA, 2, 12, 12, NA, NA, NA, 3, NA)), 
                row.names = 100:110, .Names = c("Endoscopist", "NumbOfBx"), class = "data.frame")

num_bx <- function(.data, group, variable) {
    group <- enquo(group)
    variable <- enquo(variable)

    .data %>% 
        tidyr::drop_na(!!variable) %>% 
        group_by(!!group) %>% 
        summarise(avg = mean(!!variable))
}

vv %>% num_bx(Endoscopist, NumbOfBx)
#> # A tibble: 3 x 2
#>   Endoscopist   avg
#>         <chr> <dbl>
#> 1   John Boy      7
#> 2   Jupi Ter      4
#> 3    Phil Ip      3

or if you want to keep it as strings instead of unquoted names,

num_bx <- function(.data, group, variable) {
    group <- rlang::sym(group)
    variable <- rlang::sym(variable)

    .data %>% 
        tidyr::drop_na(!!variable) %>% 
        group_by(!!group) %>% 
        summarise(avg = mean(!!variable))
}

vv %>% num_bx("Endoscopist", "NumbOfBx")
#> # A tibble: 3 x 2
#>   Endoscopist   avg
#>         <chr> <dbl>
#> 1   John Boy      7
#> 2   Jupi Ter      4
#> 3    Phil Ip      3

Upvotes: 2

Artem Sokolov
Artem Sokolov

Reputation: 13691

Following the dplyr programming vignette, define your function as follows:

NumBx <- function( x, y, z )
{
    yy <- enquo( y )
    zz <- enquo( z )

    data.frame(x) %>% filter( !is.na(!!yy) ) %>% group_by( !!zz ) %>%
        summarize( avg = mean(!!yy) )
}

You can now call it as:

NumBx( vv, NumbOfBx, Endoscopist )
#   Endoscopist   avg
#         <chr> <dbl>
# 1   John Boy      7
# 2   Jupi Ter      4
# 3    Phil Ip      3

Some notes:

  1. The order of arguments in your call seemed reversed. You want to group by z, but you were passing NumbOfBx as the z argument.
  2. na.rm=TRUE is redundant. You are already filtering out the rows, where the y variable is NA.
  3. The mean of John Boy should be 7, not 28 (the value stated in your intended output).

Upvotes: 1

Related Questions