How to shorten the code in broom while running multiple regression in R?

Question

I am running multiple regressions by different groups. I want to automate things a bit more. I tried to run and save the regressions model1, model2 and model3 initially. Then i tried to shorten the code as the following:

 temp <- df %>%
      group_by(group) %>%
      do(model1, model2, model3, data = .))) %>%   
      gather(model_name, model, -group) %>%                        
      unnest()

But this did not work. Below you find the long version that works. Can someone guide me on how to make it a bit shorter?

df <- tibble(
  a = rnorm(1000),
  b = rnorm(1000),
  c = rnorm(1000),
  d = rnorm(1000),
  group =sample.int(300,size=1000,replace=TRUE)-1)
)

df$group = as.factor(df$group)

temp1 <- df %>%
  group_by(group) %>%
  do(model2 = tidy(lm(a ~ b , data = .))) %>%   
  gather(model_name, model, -group) %>%                        
  unnest() 

temp2 <- df %>%
  group_by(group) %>%
  do(model2 = tidy(lm(a ~ c , data = .))) %>%   
  gather(model_name, model, -group) %>%                        
  unnest() 

temp3 <- df %>%
  group_by(group) %>%
  do(model3 = tidy(lm(a ~ d , data = .))) %>%   
  gather(model_name, model, -group) %>%                        
  unnest()

Ben · Accepted Answer

This might work, using nest_by and map from purrr.

Instead of group_by try using nest_by (dplyr version 1.0.0) and run your model on each row of nested data. Using nest_by will create a new temporary list column data. It is similar to previously using nest and rowwise. The model needs to be in list as well here.

Using map you can conduct models for each independent variable "b", "c", and "d". Independent variable is also included as separate column (could also be a label for the particular model).

library(tidyverse)
library(purrr)
library(broom)

df %>%
  nest_by(group) %>%
  mutate(model = list(map(c("b", "c", "d"), ~
                       cbind(independent = .x, 
                             tidy(lm(formula(paste0("a ~ ", .x)), data = data)))))) %>%
  summarise(bind_rows(model))

Output

   group independent term        estimate std.error statistic p.value
                                  
 1 0     b           (Intercept)   0.0480   NaN       NaN     NaN    
 2 0     b           b             0.268    NaN       NaN     NaN    
 3 0     c           (Intercept)  -0.124    NaN       NaN     NaN    
 4 0     c           c            -0.447    NaN       NaN     NaN    
 5 0     d           (Intercept)  -0.107    NaN       NaN     NaN    
 6 0     d           d             0.377    NaN       NaN     NaN    
 7 1     b           (Intercept)   0.473      0.296     1.60    0.356
 8 1     b           b             0.383      0.261     1.47    0.380
 9 1     c           (Intercept)   0.547      0.544     1.01    0.498
10 1     c           c            -0.183      0.798    -0.229   0.857

Edit (12/19/20): In case you want to include models with multiple covariates and interaction terms, you could simply provide the formula in the string.

For example, if you want to run 3 models per group:

main effects "b" and "c" and interaction term "b*c"
main effects "c" and "d"
main effects "d"

You could do the following:

df %>%
  nest_by(group) %>%
  mutate(model = list(map(c("b + c + b*c", "c + d", "d"), ~
                       cbind(model = .x, 
                             tidy(lm(formula(paste0("a ~ ", .x)), data = data)))))) %>%
  summarise(bind_rows(model))

Output

   group model       term        estimate std.error statistic  p.value
                                   
 1 0     b + c + b*c (Intercept)   0.718      0.281    2.56     0.0835
 2 0     b + c + b*c b             0.819      0.348    2.35     0.100 
 3 0     b + c + b*c c            -0.351      0.315   -1.11     0.346 
 4 0     b + c + b*c b:c           0.0444     0.461    0.0964   0.929 
 5 0     c + d       (Intercept)   0.614      0.409    1.50     0.208 
 6 0     c + d       c            -0.271      0.439   -0.618    0.570 
 7 0     c + d       d             0.182      0.487    0.374    0.727 
 8 0     d           (Intercept)   0.605      0.383    1.58     0.175 
 9 0     d           d             0.208      0.455    0.456    0.667 
10 1     b + c + b*c (Intercept)   0.590    NaN      NaN      NaN     
# … with 2,600 more rows

Or, if you want, you can list the equations completely and separately like this:

my_models <- c(
  "a ~ b + c + b*c", 
  "a ~ c + d", 
  "a ~ d"
)

df %>%
  nest_by(group) %>%
  mutate(model = list(map(my_models, ~
                       cbind(model = .x, 
                             tidy(lm(formula(.x), data = data)))))) %>%
  summarise(bind_rows(model))

Output

   group model           term        estimate std.error statistic  p.value
                                       
 1 0     a ~ b + c + b*c (Intercept)   0.718      0.281    2.56     0.0835
 2 0     a ~ b + c + b*c b             0.819      0.348    2.35     0.100 
 3 0     a ~ b + c + b*c c            -0.351      0.315   -1.11     0.346 
 4 0     a ~ b + c + b*c b:c           0.0444     0.461    0.0964   0.929 
 5 0     a ~ c + d       (Intercept)   0.614      0.409    1.50     0.208 
 6 0     a ~ c + d       c            -0.271      0.439   -0.618    0.570 
 7 0     a ~ c + d       d             0.182      0.487    0.374    0.727 
 8 0     a ~ d           (Intercept)   0.605      0.383    1.58     0.175 
 9 0     a ~ d           d             0.208      0.455    0.456    0.667 
10 1     a ~ b + c + b*c (Intercept)   0.590    NaN      NaN      NaN     
# … with 2,600 more rows

Data

set.seed(123)

df <- tibble(
  a = rnorm(1000),
  b = rnorm(1000),
  c = rnorm(1000),
  d = rnorm(1000),
  group =sample.int(300,size=1000,replace=TRUE)-1)
)

df$group = as.factor(df$group)

How to shorten the code in broom while running multiple regression in R?

Answers (1)

Related Questions