user3450277
user3450277

Reputation: 436

Passing a variable for a column name?

For example, suppose that you had a function that applied some DPLYR functions, but you couldn't expect datasets passed to this function to have the same column names.

For a simplified example of what I mean, say you had a data frame, arizona.trees:

arizona.trees
group arizona.redwoods   arizona.oaks 
A     23                 11        
A     24                 12  
B     9                  8 
B     10                 7
C     88                 22

and another very similar data frame, california.trees:

california.trees
group    california.redwoods california.oaks 
A        25                  50        
A        11                  33  
B        90                  5 
B        77                  3
C        90                  35

And you wanted to implement a function that returns the mean for the given groups (A, B, ... Z) for a given type of tree that would work for both of these data frames.

foo <- function(dataset, group1, group2, tree.type) { 
     column.name <- colnames(dataset[2])
     result <- filter(dataset, group %in% c(group1, group2) %>%
               select(group, contains(tree.type)) %>%
               group_by(group) %>%
               summarize("mean" = mean(column.name))
     return(result)
}

A desired output for a call of foo(california.trees, A, B, redwoods) would be:

result
       mean
A       18
B       83.5

For some reason, doing something like the implementation of foo() just doesn't seem to work. This is likely due to some error with the data frame indexing - the function seems to think I am attempting to get the mean of the column.name string, rather than retrieving the column and passing the column to mean(). I'm not sure how to avoid this. There's the issue of the implicit passing of the modified dataframe that can't be directly referenced with the pipe operator that may be causing the issue.

Why is this? Is there some alternative implementation that would work?

Upvotes: 3

Views: 2309

Answers (1)

akrun
akrun

Reputation: 887991

We can use the quosure based solution from the devel version of dplyr (soon to be released 0.6.0)

foo <- function(dataset, group1, group2, tree.type){
        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))
         colN <- rlang::parse_quosure(names(dataset)[2])
         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise(mean = mean(UQ(colN)))
        }


foo(california.trees, A, B, redwoods)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo(arizona.trees, A, B, redwoods)
# A tibble: 2 × 2
#   group  mean
#  <chr> <dbl>
#1     A  23.5
#2     B   9.5

The enquotakes the input arguments and converts it to quosure, with quo_name, it is converted to string for using with %in%, the second column name is converted to quosure from string using parse_quosure and then it is unquoted (UQ or !!) for evaluation within summarise

NOTE: This is based on the OP's function about selecting the second column


The above solution was based on selecting the column based on position (as per the OP's code) and it may not work for other columns. So, we can match the 'tree.type' and get the 'mean' of the columns based on that

foo1 <- function(dataset, group1, group2, tree.type){

        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))


         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise_at(vars(contains(tree.type)), funs(mean = mean(.)))
        }

The function can be tested for different columns in the two datasets

foo1(arizona.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  11.5
#2     B   7.5

foo1(arizona.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  23.5
#2     B   9.5

foo1(california.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo1(california.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  41.5
#2     B   4.0

data

arizona.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
arizona.redwoods = c(23L, 
24L, 9L, 10L, 88L), arizona.oaks = c(11L, 12L, 8L, 7L, 22L)),
.Names = c("group", 
"arizona.redwoods", "arizona.oaks"), class = "data.frame",
 row.names = c(NA, -5L))

california.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
 california.redwoods = c(25L, 
11L, 90L, 77L, 90L), california.oaks = c(50L, 33L, 5L, 3L, 35L
)), .Names = c("group", "california.redwoods", "california.oaks"
), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 4

Related Questions