r.bot
r.bot

Reputation: 5424

Tidy approach to regression models, ideally with dplyr

Reading the documentation for do() in dplyr, I've been impressed by the ability to create regression models for groups of data and was wondering whether it would be possible to replicate it using different independent variables rather than groups of data.

So far I've tried

require(dplyr)
data(mtcars)

models <- data.frame(var = c("cyl", "hp", "wt"))

models <- models %>% do(mod = lm(mpg ~ as.name(var), data = mtcars))
Error in as.vector(x, "symbol") : 
  cannot coerce type 'closure' to vector of type 'symbol'

models <- models %>% do(mod = lm(substitute(mpg ~ i, as.name(.$var)), data = mtcars))
Error in substitute(mpg ~ i, as.name(.$var)) : 
  invalid environment specified

The desired final output would be something like

  var slope standard_error_slope
1 cyl -2.87                 0.32
2  hp -0.07                 0.01
3  wt -5.34                 0.56

I'm aware that something similar is possible using a lapply approach, but find the apply family largely inscrutable. Is there a dplyr solution?

Upvotes: 3

Views: 669

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

This isn't pure "dplyr", but rather, "dplyr" + "tidyr" + "data.table". Still, I think it should be pretty easily readable.

library(data.table)
library(dplyr)
library(tidyr)

mtcars %>%
  gather(var, val, cyl:carb) %>%
  as.data.table %>%
  .[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
#      var    Estimate  Std. Error
#  1:  cyl -2.87579014 0.322408883
#  2: disp -0.04121512 0.004711833
#  3:   hp -0.06822828 0.010119304
#  4: drat  7.67823260 1.506705108
#  5:   wt -5.34447157 0.559101045
#  6: qsec  1.41212484 0.559210130
#  7:   vs  7.94047619 1.632370025
#  8:   am  7.24493927 1.764421632
#  9: gear  3.92333333 1.308130699
# 10: carb -2.05571870 0.568545640

If you really just wanted a few variables, start with a vector, not a data.frame.

models <- c("cyl", "hp", "wt")

mtcars %>%
  select_(.dots = c("mpg", models)) %>%
  gather(var, val, -mpg) %>%
  as.data.table %>%
  .[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
#    var    Estimate Std. Error
# 1: cyl -2.87579014  0.3224089
# 2:  hp -0.06822828  0.0101193
# 3:  wt -5.34447157  0.5591010

Upvotes: 4

Hong Ooi
Hong Ooi

Reputation: 57697

There's nothing particularly complicated about the approach in the linked page. The use of substitute and as.name is a bit arcane, but that's easily rectified.

varlist <- names(mtcars)[-1]
models <- lapply(varlist, function(x) {
    form <- formula(paste("mpg ~", x))
    lm(form, data=mtcars)
})

dplyr is not the be-all and end-all of R programming. I'd suggest getting familiar with the *apply functions as they'll be of use in many situations where dplyr doesn't work.

Upvotes: 7

Related Questions