RStudio35
RStudio35

Reputation: 11

Variable not found in a function using dplyr functions

I tried for a few hours to create a function to get statistical information on quantitative variables.

Here it's a little part of my dataframe with many quantitative variables call tabl_profil1 :

  DateDiag   Age AgeDiag
     <dbl> <dbl>   <dbl>
1    1996.   43.     21.
2    2001.   53.     36.
3    2005.   75.     62.
4    1998.   62.     42.
5    2016.   53.     51.
6    2008.   65.     55.

I want to do a function to compute several statistical information (mean, median, max, min, confidence interval) and to have the resultats in a new dataframe.

I tried in different ways but I always met problems.

function1 <- function(VarName){results <<- tabl_profil1 %>% summarise(Mean = mean(VarName))}
function1(Age)

The mistakes is :

Error in summarise_impl(.data, dots) : 
  Evaluation error: object 'Age' not found. 

I also tried with tabl_profil1[[VarName]] in the function but it's doesn't work.

Hope you can help me and thanks by advance, Pierre

Upvotes: 0

Views: 743

Answers (1)

camille
camille

Reputation: 16871

This is non-standard evaluation. If you want to use bare column names as is typical in dplyr functions, you need to use enquo to create a quosure. Then when you call that variable, you need a !! in front of its name. Try this:

function1 <- function(VarName){
    var <- enquo(VarName)
    results <<- tabl_profil1 %>% summarise(Mean = mean(!!var))
    }
function1(Age)

In response to discussion in comments: using <<- inside a function like this isn't a great idea for a few reasons. First, it means that you're defining a function that only acts on a specific data frame, in this case tabl_profil1, and returns results to only a specific variable, in this case by assigning back to results. This pretty much defeats the purpose of writing a function, which is to flexibly repeat an operation.

<<- used this way also isn't that safe, since you'll end up with a value stored in results that you might not know exactly where it came from. It's better to be able to say you called a function and returned output to a certain variable, and you can see in your code exactly where you did that.

Also, the advantage of the dplyr model is that you can operate on a data frame in a function and pipe that output along to the next function. You lose this by not having a data frame as the first argument.

A better way to structure this function would be like:

function1 <- function(df, VarName){
  var <- enquo(VarName)
  df %>% summarise(Mean = mean(!!var))
}

Now this function operates on any data frame you pass it, and adds to that data frame the mean of any variable you include as the second argument. Now you can call something like:

mean_age <- function1(tabl_profil1, Age)
mean_height_from_other_tbl <- function1(other_table, Height)

This works on multiple data frames, and returns output that can be stored to whatever variable you want. Obviously I made up the second call as illustration.

Upvotes: 1

Related Questions