loop or apply multiple regressions, extract coefficients and p-values into data frame

Question

I have a data frame with 3 dependent (LHS) variables and 4 independent (RHS) variables. I'd like to run a linear regression of each LHS variable on each RHS varaiable and store the results of each regression as a row in the data frame with the columns: lhs, rhs, Estimate, Std. Error, t value, Pr(>|t|).

For example, using mtcars, I considered a nested loop:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat', 'wt', 'qsec')

reg_count <- 1
for (i in lhs){
  for (j in rhs){
    model <- lm(i ~ j, data = mtcars)
    results[reg_count] <- coef(summary(model))
    reg_count <- reg_count + 1
  }
}

However, this fails for a number of reasons. Is there a simple way I can do this? Ideally using an apply() function rather than a loop?

Gregor Thomas · Accepted Answer

Here's how I would do it. I shortened your example a little, but that won't matter:

lhs <- c('mpg', 'cyl', 'disp')
rhs <- c('hp', 'drat')

models = list()
for (i in lhs){
  for (j in rhs){
    models[[paste(i, "vs", j)]] <- lm(as.formula(paste(i, "~", j)), data = mtcars)
  }
}

If you want to use apply, you'll need to start with a matrix. The difference in runtime will be negligible.

# with apply:
coefs_mat = expand.grid(lhs, rhs)
mods = apply(coefs_mat, 1, function(row) {
  lm(as.formula(paste(row[1], "~", row[2])), data = mtcars)
})
names(mods) = with(coefs_mat, paste(Var1, "vs", Var2))

Both methods give the same results. Now we can pull the coefficients, etc. with broom::tidy

# get coefs
library(broom)
coefs = lapply(mods, tidy, simplify = F)
# combine
dplyr::bind_rows(coefs, .id = "mod")
#             mod        term      estimate    std.error  statistic      p.value
# 1     mpg vs hp (Intercept)   30.09886054 1.633921e+00 18.4212465 6.642736e-18
# 2     mpg vs hp          hp   -0.06822828 1.011930e-02 -6.7423885 1.787835e-07
# 3     cyl vs hp (Intercept)    3.00679525 4.254852e-01  7.0667442 7.405351e-08
# 4     cyl vs hp          hp    0.02168354 2.635142e-03  8.2286042 3.477861e-09
# 5    disp vs hp (Intercept)   20.99248341 3.260662e+01  0.6438104 5.245902e-01
# 6    disp vs hp          hp    1.42977003 2.019414e-01  7.0801224 7.142679e-08
# 7   mpg vs drat (Intercept)   -7.52461844 5.476663e+00 -1.3739423 1.796391e-01
# 8   mpg vs drat        drat    7.67823260 1.506705e+00  5.0960421 1.776240e-05

We can also pull out model summary stats:

# get summary stats
summ = lapply(mods, glance, simplify = F)
dplyr::bind_rows(summ, .id = "mod")
#            mod r.squared adj.r.squared     sigma statistic      p.value df     logLik
# 1    mpg vs hp 0.6024373     0.5891853  3.862962  45.45980 1.787835e-07  2  -87.61931
# 2    cyl vs hp 0.6929688     0.6827344  1.005944  67.70993 3.477861e-09  2  -44.56307
# 3   disp vs hp 0.6255997     0.6131197 77.089503  50.12813 7.142679e-08  2 -183.41236
# 4  mpg vs drat 0.4639952     0.4461283  4.485409  25.96964 1.776240e-05  2  -92.39996
# 5  cyl vs drat 0.4899134     0.4729105  1.296596  28.81354 8.244636e-06  2  -52.68517
# 6 disp vs drat 0.5044038     0.4878839 88.693360  30.53315 5.282022e-06  2 -187.89934
#         AIC       BIC     deviance df.residual
# 1 181.23863 185.63584    447.67431          30
# 2  95.12614  99.52335     30.35771          30
# 3 372.82473 377.22194 178283.74604          30
# 4 190.79993 195.19714    603.56673          30
# 5 111.37033 115.76754     50.43482          30
# 6 381.79868 386.19588 235995.36410          30

loop or apply multiple regressions, extract coefficients and p-values into data frame

Answers (2)

Related Questions