green diod
green diod

Reputation: 1499

Passing data-variables to R formulas

Let's say I'd like to write anscombe %>% lm_tidy("x1", "y1") (Actually, I'd like to write anscombe %>% lm_tidy(x1, y1), where x1 and y1 are part of the data frame). So, as the following function seems working:

plot_gg <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(df, aes(x = !!x, y = !!y)) + geom_point() +
    geom_smooth(formula = y ~ x, method="lm", se = FALSE)
}

I started writing the following function:

lm_tidy_1 <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  fm <- y ~ x            ##### I tried many stuff here!
  lm(fm, data=df)
}
## Error in model.frame.default(formula = fm, data = df, drop.unused.levels = TRUE) : 
##   object is not a matrix

One comment in passing in column name as argument states that embrace {{...}} is a shorthand notation for the quote-unquote pattern. Unfortunately, error messages were different in both situations:

lm_tidy_2 <- function(df, x, y) {
  fm <- !!enquo(y) ~ !!enquo(x) # alternative: {{y}} ~ {{x}} with different errors!!
  lm(fm, data=df)
}
## Error:
## ! Quosures can only be unquoted within a quasiquotation context.

This seems working (based on @jubas's answer but we're stuck with string handling and paste):

lm_tidy_str <- function(df, x, y) {
  fm <- formula(paste({{y}}, "~", {{x}}))
  lm(fm, data=df)
}

Yet again, {{y}} != !!enquo(y). But it's worse: the following function breaks down with the same Quosure error as earlier:

lm_tidy_str_1 <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  fm <- formula(paste(!!y, "~", !!x))
  lm(fm, data=df)
}
  1. Is {{y}} != !!enquo(y)?
  2. How to pass data-variables to lm?

EDIT: Sorry, there were left-overs from my many trials. I want to directly pass the data-variables (say x1 and y1) to the function that is going to use them as formula components (such as lm) and not their string versions ("x1" and "y1"): I try to avoid strings as long as possible and it's more streamlined from the user perspective.

Upvotes: 4

Views: 1264

Answers (4)

Adrian Chira
Adrian Chira

Reputation: 77

Here's what I use:

  fm <- as.formula(paste0(y, ' ~ ', x))
  lm(fm, data=df)

See:

?as.formula

Upvotes: 1

Brian Syzdek
Brian Syzdek

Reputation: 948

Wrap the formula in "expr," then evaluate it.

library(dplyr)
lm_tidy <- function(df, x, y) {
  x <- sym(x)
  y <- sym(y)
  fm <- expr(!!y ~ !!x)
  lm(fm, data = df)
}

This function is equivalent:

lm_tidy <- function(df, x, y) {
  fm <- expr(!!sym(y) ~ !!sym(x))
  lm(fm, data = df)
}

Then

lm_tidy(mtcars, "cyl", "mpg")

gives

Call:
lm(formula = fm, data = .)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

EDIT per comment below:

library(rlang)
lm_tidy_quo <- function(df, x, y){
    y <- enquo(y)
    x <- enquo(x)
    fm <- paste(quo_text(y), "~", quo_text(x))
    lm(fm, data = df)
}

You can then pass symbols as arguments

lm_tidy_quo(mtcars, cyl, mpg)

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

Consider:

lm_tidy_1 <- function(df, x, y) {
  fm <- reformulate(as.character(substitute(x)), substitute(y))
  lm(fm, data=df)
}

lm_tidy_1(iris, Species, Sepal.Length)
lm_tidy_1(iris, 'Species', Sepal.Length)
lm_tidy_1(iris, Species, 'Sepal.Length')
lm_tidy_1(iris, 'Species', 'Sepal.Length')

Edit:

If you need the formula to appear, change the call object:

lm_tidy_1 <- function(df, x, y) { 
   fm <- reformulate(as.character(substitute(x)), substitute(y)) 
   res<-lm(fm, data=df) 
   res$call[[2]]<- fm
   res
}

lm_tidy_1(iris, Species, Sepal.Length) 

Call:
lm(formula = Sepal.Length ~ Species, data = df)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            5.006              0.930              1.582  

Upvotes: 5

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

@BiranSzydek's answer is pretty good. However it has 3 downsides:

Call:
lm(formula = fm, data = .)
  1. One cannot see the formula nor the data which were actually used.
  2. One has to input the symbols as strings.
  3. The dependency from rlang - though it is a great package.

You can indeed solve this problem with pure base R!

The solution in pure base R

R is actually under-the-hood a Lisp. It is suitable for such meta-programming tasks. The only downside of R is its horrible syntax. Especially when facing meta-programming, it is not as beautiful and as elegant like the Lisp languages. The syntax really can confuse a lot - as you experienced it yourself when trying to solve this problem.

The solution is to use substitute() by which you can substitute code pieces in a quoted manner:

lm_tidy <- function(df, x, y) {
  # take the arguments as code pieces instead to evaluate them:
  .x <- substitute(x)
  .y <- substitute(y)
  .df <- substitute(df)
  # take the code piece `y ~ x` and substitute using list lookup table
  .fm <- substitute(y ~ x, list(y=.y, x=.x))
  # take the code `lm(fm, data=df)` and substitute with the code pieceses defined by the lookup table
  # by replacing them by the code pieces stored in `.fm` and `.df`
  # and finally: evaluate the substituted code in the parent environment (the environment where the function was called!)
  eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}

The trick is to use eval.parent(substitute( <your expression>, <a list which determines the evaluation lookup-table for the variables in your expression>)).

Beware of scoping! As long as <your expression> is constructed only using variables which are defined inside the function or inside the lookup-list given to substitute(), there won't be any scoping problems! But avoid to refer to any other variables within <your expression>! - So this is the only rule you have to obey to use eval()/eval.parent() safely in this context! but even if, the eval.parent() takes care, that the substituted code is executed within the environment where this function was called.

Now, you can do:

lm_tidy(mtcars, cyl, mpg)

the output is now as desired:

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

And we did this with pure base R!

The trick for safe use of eval() is really that every variable in the substitute() expression is defined/given inside the lookup tables for substitute() or the function's argument. In other words: None of the replaced variables refers to any dangling variables outside the function definition.

plot_gg function

So following these rules, your plot_gg function would be defined as:

plot_gg <- function(df, x, y) {
  .x <- substitute(x)
  .y <- substitute(y)
  .df <- substitute(df)
  .fm <- substitute( y ~ x, list(x=.x, y=.y))
  eval.parent(substitute(
    ggplot(df, aes(x=x, y=y)) + geom_point() +
      geom_smooth(formula = fm, method="lm", se=FALSE),
    list(fm=.fm, x=.x, y=.y, df=.df)
  ))
}

When you want to enter x and y as strings


lm_tidy_str <- function(df, x, y) {
  .x <- as.name(x)
  .y <- as.name(y)
  .df <- substitute(df)
  .fm <- substitute(y ~ x, list(y=.y, x=.x))
  eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}

plot_gg_str <- function(df, x, y) {
  .x <- as.name(x)
  .y <- as.name(y)
  .df <- substitute(df)
  .fm <- substitute( y ~ x, list(x=.x, y=.y))
  eval.parent(substitute(
    ggplot(df, aes(x=x, y=y)) + geom_point() +
      geom_smooth(formula = fm, method="lm", se=FALSE),
    list(fm=.fm, x=.x, y=.y, df=.df)
  ))
}

lm_tidy_str(mtcars, "cyl", "mpg")

# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
# 
# Coefficients:
# (Intercept)          cyl  
#      37.885       -2.876  
# 

require(ggplot2)
plot_gg_str(mtcars, "cyl", "mpg")



Upvotes: 6

Related Questions