Raul Guarini Riva
Raul Guarini Riva

Reputation: 795

How to correctly select dependent variable and controls in lm() in R?

I am looking for a way to construct a function that will allow the user, given a data frame, to:

  1. Choose what variable will be the dependent variable;
  2. Choose a certain control (one only variable);
  3. Based on these choices, run several different regressions;

Ideally, I would like something like this:

df <- mtcars

myreg <- function(dv, control) {
  mod1 <- lm(dv ~ control + mpg, data = df)
  mod2 <- lm(dv ~ control + wt, data = df)
  mod3 <- lm(dv ~ control + wt + mpg, data = df)
  return(list("reg1" = mod1, "reg2" = mod2, "reg3" = mod3))
}

My first guess was that providing strings with the names of the columns would work. But R throws an error. It tells me that the length of variables do not match. Also, not passing strings and just names won't work either.

I tried some solution based on the get function. However, it does not parse the control correctly and, honestly, I don't understand exactly how it works yet.

What's the correct way of implementing these choices in a function?

Upvotes: 3

Views: 1018

Answers (2)

Ben
Ben

Reputation: 1486

What you are trying to do is to create a wrapper function for regression. Ideally you would program this so that the function can take in an arbitrary dataset and a specified response variable, control variable and other model terms, and then produce all the models of interest.

Usually we want the user to enter variable names without quotes, and we extract the name of this input as an "unevaluated expression" using deparse(substitute(...)). For the other model terms in the regression, since there can be arbitrarily many of them, a reasonable input would be a list of character vectors. When you create a wrapper function it will generally construct an "internal call" for the models of interest, which may differ from the standard call for these models. Consequently, you will usually also want to amend the call for the models so that they look like the standard call. So, using this syntax, you would write something like this:

myreg <- function(response, control, other.terms = NULL, data) {
  
  #Get variable names and other terms
  DATA.NAME     <- deparse(substitute(data))
  RESPONSE.NAME <- deparse(substitute(response))
  CONTROL.NAME  <- deparse(substitute(control))
  if (is.null(other.terms)) {
    OTHER <- vector(mode = 'list', length = 1) } else {
    OTHER <- other.terms }

  #Set formula and model objects
  m <- length(OTHER)
  FORMULAE <- vector(mode = 'list', length = m)
  MODELS   <- vector(mode = 'list', length = m)
  names(MODELS) <- sprintf('MODEL%s', 1:m)
  FORM     <- paste(RESPONSE.NAME, '~', CONTROL.NAME)
  
  #Fit models
  for (i in 1:m) {
    
    #Set the formula
    if (is.null(OTHER[[i]])) {
      FORMULAE[[i]] <- FORM } else {
      FORMULAE[[i]] <- paste(FORM, '+', paste(OTHER[[i]], collapse = '+')) }
    
    #Fit the model and substitute the call
    MODELS[[i]] <- lm(formula(FORMULAE[[i]]), data = data)
    CALL <- paste0('lm(formula = ', FORMULAE[[i]], ', data = ', DATA.NAME, ')')
    MODELS[[i]]$call <- parse(text = CALL)[[1]] }
  
  #Return the models
  MODELS }

You can then use the function to produce a list of multiple regression models with the specified variables. Here is an example where you produce three different models, each with the same response and control variable, but with different additional terms in the models:

(MODELS <- myreg(response    = hp, 
                 control     = cyl,
                 other.terms = list('mpg', 'wt', c('mpg', 'wt')), 
                 data        = mtcars))

$MODEL1

Call:
lm(formula = hp ~ cyl + mpg, data = mtcars)

Coefficients:
(Intercept)          cyl          mpg  
     54.067       23.979       -2.775  

$MODEL2

Call:
lm(formula = hp ~ cyl + wt, data = mtcars)

Coefficients:
(Intercept)          cyl           wt  
     -51.81        31.39         1.33  

$MODEL3

Call:
lm(formula = hp ~ cyl + mpg + wt, data = mtcars)

Coefficients:
(Intercept)          cyl          mpg           wt  
     115.66        25.03        -4.22       -12.13

Upvotes: 1

ralph
ralph

Reputation: 223

You should work with string substitution. The snippet below provides a simple overview of how you could adjust your function. It would also be good practice to pass the dataset df as an additional parameter in your function.

df <- mtcars

## these would be function inputs
dv              <- "mpg"
control         <- "cyl"

## this would form the function body 
tmpl            <- "dv ~ control" # create a template formula 
tmpl.dv         <- gsub("dv", dv, tmpl) # insert the dv
tmpl.dv.control <- gsub("control", control, tmpl.dv) # insert the control
form            <- as.formula(tmpl.dv.control) # create the formula to use in lm

## fit the model
mod <- lm(form, data = df)

Upvotes: 1

Related Questions