How to Run Several Multivariate Regression Models on Nested Dataframes?

Question

EDIT: I've resorted to using the mtcars dataset instead, as it better represents the structure of my real data. Disregard the iris example below:

I'm looking to run several regression models with multiple dependent and independent variables.

Say I'm working on the iris dataset.

My dependent variables would be Petal.Length and Petal.Width. My independent variables would be Sepal.Length and Sepal.Width.

I would like to run the following models for each species in the dataset separately (setosa, versicolor, virginica):

lm(Petal.Length ~ Sepal.Length)
lm(Petal.Length ~ Sepal.Width)
lm(Petal.Width ~ Sepal.Length)
lm(Petal.Width ~ Sepal.Width)

My real dataset has many more dependent and independent variables, which is why I'd like to write code that is less repetitious. Here is my attempt so far, where I've nested the Species column with Tidyr, and mutated a new column containing the regression data for each species:

library(dplyr)
library(tidyr)
library(purrr)

iris %>%
  nest(data = !Species) %>%
  mutate(
    fit = map(data, ~ lm (cbind(Petal.Length, Petal.Width) ~ ., data = .x)) 
  )

Which gives the equivalent of this for each species:

lm(Petal.Length ~ Sepal.Length + Sepal.Width)
lm(Petal.Width ~ Sepal.Length + Sepal.Width)

Let's use the mtcars dataset instead:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

My dependent variables would be disp and hp. My independent variables would be drat, wt, and qsec.

I would like to run the following models for each vs and am type in the dataset separately:

lm(disp ~ drat)
lm(disp ~ wt)
lm(disp ~ qsec)
lm(hp ~ drat)
lm(hp ~ wt)
lm(hp ~ qsec)

My real dataset has many more dependent and independent variables, which is why I'd like to write code that is less repetitious. Here is my attempt so far, where I've nested the vs and am columns with Tidyr, and mutated a new column containing the regression data for each vs and am :

library(dplyr)
library(tidyr)
library(purrr)

mtcars %>%
  group_by(vs,am) %>%
  nest() %>%
  mutate(
    fit = map(data, ~ lm (cbind(disp, hp) ~ drat + wt + qsec, data = .x)) 
  )

# A tibble: 4 x 4
# Groups:   vs, am [4]
     vs    am data              fit   
                
1     0     1    
2     1     1    
3     1     0    
4     0     0

Which gives the equivalent of the following formulas for each combination of vs and am:

lm(disp ~ drat + wt + qsec)
lm(hp ~ drat + wt + qsec)

But this isn't what I want. The above code gives two multiple regressions per vs and am combination, when I want six simple regressions. I have no idea how to produce this in R.

Onyambu · Accepted Answer

This is modelling, and using the correct formula after reshaping the data, you can get the desired results without the loops.

new_data <- pivot_longer(iris, c(Sepal.Length, Sepal.Width), 
                                names_to = 'Sepal', names_prefix = 'Sepal')

lm(cbind(Petal.Width, Petal.Length) ~ value/(Species:Sepal) - value + 
                    Sepal/Species - Sepal + 0, new_data)

Call:
lm(formula = cbind(Petal.Width, Petal.Length) ~ value/(Species:Sepal) - 
                value + Sepal/Species - Sepal + 0, data = new_data)

Coefficients:
                                      Petal.Width  Petal.Length
Speciessetosa:Sepal.Length            -0.17022      0.80305    
Speciesversicolor:Sepal.Length         0.08326      0.18512    
Speciesvirginica:Sepal.Length          1.22611      0.61047    
Speciessetosa:Sepal.Width              0.02418      1.18292    
Speciesversicolor:Sepal.Width          0.16691      1.93492    
Speciesvirginica:Sepal.Width           0.66406      3.51090    
value:Speciessetosa:Sepal.Length       0.08314      0.13163    
value:Speciesversicolor:Sepal.Length   0.20936      0.68647    
value:Speciesvirginica:Sepal.Length    0.12142      0.75008    
value:Speciessetosa:Sepal.Width        0.06471      0.08141    
value:Speciesversicolor:Sepal.Width    0.41845      0.83938    
value:Speciesvirginica:Sepal.Width     0.45795      0.68632

If you want to run the for loops:

iris %>%
  nest(data = !Species) %>%
  summarise(model = map(c('Sepal.Length', 'Sepal.Width'),
                               ~map(data,~lm(reformulate(.y, 'cbind(Petal.Width, Petal.Length)'),.x) %>%
                                               coef()%>%
                                               as_tibble(rownames = 'rn'), .y = .x)%>%
                          set_names(Species)%>%
                          bind_rows(.id='Species')))%>%
  unnest(model) 

# A tibble: 12 x 4
   Species    rn           Petal.Width Petal.Length
                               
 1 setosa     (Intercept)      -0.170        0.803 
 2 setosa     Sepal.Length      0.0831       0.132 
 3 versicolor (Intercept)       0.0833       0.185 
 4 versicolor Sepal.Length      0.209        0.686 
 5 virginica  (Intercept)       1.23         0.610 
 6 virginica  Sepal.Length      0.121        0.750 
 7 setosa     (Intercept)       0.0242       1.18  
 8 setosa     Sepal.Width       0.0647       0.0814
 9 versicolor (Intercept)       0.167        1.93  
10 versicolor Sepal.Width       0.418        0.839 
11 virginica  (Intercept)       0.664        3.51  
12 virginica  Sepal.Width       0.458        0.686

EDIT: in the case of mtcars dataset, the following will suffice:

pivot_longer(mtcars, c(drat, wt, qsec)) %>%
  unite('am_vs', c(am, vs)) %>%
  lm(cbind(disp, hp)~value/(am_vs:name)-value + name/am_vs - name + 0, .)

Running using the map:

mtcars%>%
  nest_by(vs, am)%>%
  rowwise()%>%
  summarise(vs, am, model = map(c('drat', 'wt', 'qsec'),
                 ~lm(sprintf('cbind(disp, hp)~%s',.x), data)%>%
                   coef()%>%
                   as_tibble(rownames = 'rn')))%>%
  unnest(model)

How to Run Several Multivariate Regression Models on Nested Dataframes?

Answers (1)

Related Questions