Jaken
Jaken

Reputation: 543

Passing data frame columns into simple functions with NSE

Every time I think I've figured out the details of passing data frame columns into functions, I find a new situation that complicates the process.

I have a custom function in which I'm passing the data frame columns using curly brackets {{}}. This works great for calling them as part of dplyr sequences, as shown in sampfun1 below. However, if I want to use a very simple function on a single column (for example, sd(mtcars$disp)), I run into difficulties, as it does not seem possible to use the curly brackets directly on the dataframe (df${{col}} or any similar alternative I've tried).

Right now I'm getting around this by using df[[deparse(substitute(col))]], as shown in sampfun2 below. This is fine, but is a bit clunky, especially in complex functions where multiple columns are being passed and then being used in different ways. Is there a simpler way to achieve the output for sampfun2? I know I could just pass the column name as a string and go directly to df[[col], but I'd like to avoid that since I'm using the column in other ways elsewhere in the function.

library(dplyr)

sampfun1 <- function(df, col){
  df %>% 
    mutate(xsd = sd({{col}}))
}

sampfun2 <- function(df, col){
  colStr <- deparse(substitute(col))
  dat_sd <- sd(df[[colStr]])
}

disp_sd1 <- sampfun1(mtcars, disp)
disp_sd2 <- sampfun2(mtcars, disp)

EDIT for clarification: This is a very simplified function just to display the issue of passing a column into a function and then calling just the column (rather than e.g. something through dplyr that calls first the data frame and then the function). My goal isn't to pass a large number of columns to the same function, just to simplify the syntax if I need to repeatedly call that column in different contexts. When calling a subset of the data frame using dplyr, this isn't a problem - it only arises when trying to extract the column. Here is another example to maybe better illustrate what I'm trying to do:

sampfun3 <- function(df, col){
  single_col <- df %>% select({{col}}) %>% pull()
  dat_sd <- sd(single_col)
}

This also works for what I'm trying to do, though it's a little more cumbersome than sampfun2. I was just wondering if there's a simpler way to extract a specific column when it's been passed using {{}}.

Upvotes: 4

Views: 151

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270428

The first two use rlang but the last two seem closer to what is mentioned in the comment.

1) dplyr If the problem is to calculate the sd of several columns then we can pass a selection using tidy-select syntax.

library(dplyr)

sampfun3 <- function(df, sel) {
  df %>% summarize(across({{sel}}, sd))
}

sampfun3(mtcars, mpg:disp)  # columns from mpg to disp
##        mpg      cyl     disp
## 1 6.026948 1.785922 123.9387

sampfun3(mtcars, starts_with("c"))  # columns whose name starts with c
##        cyl   carb
## 1 1.785922 1.6152

sampfun3(mtcars, disp)  # just disp
##       disp
## 1 123.9387

2) rlang If the problem is not multiple columns but rather just avoiding character strings then this does not use any character strings anywhere. It requires one extra line of code for each unquoted argument passed.

library(rlang)

sampfun4 <- function(df, col) {
  col <- eval_tidy(enquo(col), df)
  sd(col) 
}

sampfun4(mtcars, disp)
## [1] 123.9387

3) Base R With this approach we start and end the function body as shown and between those two lines we can have as many lines and references to arguments as desired with no extra per-argument code.

sampfun5 <- function(df, col1, col2) eval.parent(substitute({
  sd(df$col1) / mean(df$col2)
}))

sampfun5(mtcars, disp, cyl)
## [1] 20.0305

4) gtools defmacro in gtools provides a wrapper implementing (3). See the article by Thomas Lumley starting on page 11 of https://cran.r-project.org/doc/Rnews/Rnews_2001-3.pdf .

library(gtools)

sampfun6 <- defmacro(df, col1, col2, expr = {
  sd(df$col1) / mean(df$col2)
})

sampfun6(mtcars, disp, cyl)
## [1] 20.0305

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 67020

More approaches:

sampfun3 <- function(df, col) {
  df |> pull({{col}}) |> sd()
}

> sampfun3(mtcars, disp)
[1] 123.9387



sampfun4 <- function(df, col){
  df |> summarize(across( {{col}}, ~sd(.x)))
}

sampfun4(mtcars, disp)

> sampfun3(mtcars, disp)
      disp
1 123.9387

Upvotes: 1

jay.sf
jay.sf

Reputation: 73832

You could use dots and match.call.

sampfun3 <- function(df, ..., FUN=sd) {
  args <- match.call(expand.dots=FALSE)$...
  df[sapply(args, deparse)] |> sapply(FUN)
}

> sampfun3(mtcars, disp)
    disp 
123.9387 
> sampfun3(mtcars, disp, mpg, am, hp)
       disp         mpg          am          hp 
123.9386938   6.0269481   0.4989909  68.5628685 
> sampfun3(mtcars, disp, mpg, am, hp, FUN=mean)
     disp       mpg        am        hp 
230.72188  20.09062   0.40625 146.68750 

Upvotes: 1

Related Questions