Running multiple linear regressions across several columns of a data frame in R

Question

I have a dataset structured as such: enter image description here

I would like to run linear regression models and ANOVA using V1, V2...etc. as the independent variables and the g column as the dependent variable in each case (i.e. lm(V1 ~ g), lm(V2 ~ g), and so forth). This would be straightforward except that these linear regressions need to be grouped by level in the pair column, such that, for example, my output contains lm(V1 ~ g) for all rows with pair 1.1 and lm(V1 ~ g) for all pairs 1.201, etc.

I've tried a number of approaches using for loops, lapply and the data.table package, and nothing gives me exactly the output I'd like. Can anyone give me any insight on the best way to tackle this problem?

Edit: My full data set has 7056 different pairs in the pair column and 100 V columns (V1...V100). My latest attempt at this problem:

df$pair <- as.factor(df$pair)
out <- list()
for (i in 3:ncol(df)){
    out[[i]] <- lapply(levels(df$pair), function(x) {
    data.frame(df=x, g = coef(summary(lm(df[,i]~ df$g, data=df[df$pair==x,])),row.names=NULL))})
    }

Dave Gruenewald · Accepted Answer

Let's get some tidyverse power here, along with broom, and forgo all these loops...

First I'll make a dummy table:

df <- data.frame(
  g = runif(50), 
  pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE), 
  V1 = runif(50), 
  V2 = runif(50), 
  V3 = runif(50), 
  V4 = runif(50), 
  V5 = runif(50),
  stringsAsFactors = FALSE
)

That's approximately what your data structure looks like. Now onto the meat of the code:

library(tidyverse)
library(broom)

df %>% 
  as_tibble %>% 
  gather(key = "column", value = "value", V1:V5) %>%       # first set the data in long format
  nest(g, value) %>%                                       # now nest the dependent and independent factors
  mutate(model = map(data, ~lm(g ~ value, data = .))) %>%  # fit the model using purrr
  mutate(tidy_model = map(model, tidy)) %>%                # clean the model output with broom
  select(-data, -model) %>%                                # remove the "untidy" parts
  unnest()                                                 # get it back in a recognizable data frame

Which gives us the following:

# A tibble: 30 x 7
   pair  column term        estimate std.error statistic  p.value
                              
 1 C     V1     (Intercept)  0.470       0.142    3.31   0.00561 
 2 C     V1     value        0.125       0.265    0.472  0.645   
 3 B     V1     (Intercept)  0.489       0.142    3.45   0.00359 
 4 B     V1     value       -0.0438      0.289   -0.151  0.882   
 5 A     V1     (Intercept)  0.515       0.111    4.63   0.000279
 6 A     V1     value       -0.00569     0.249   -0.0229 0.982   
 7 C     V2     (Intercept)  0.367       0.147    2.50   0.0265  
 8 C     V2     value        0.377       0.300    1.26   0.231   
 9 B     V2     (Intercept)  0.462       0.179    2.59   0.0206  
10 B     V2     value        0.0175      0.322    0.0545 0.957   
# … with 20 more rows

yep, that's a beaut! Note that I used lm(g ~ value) instead of lm(value ~ g) as this is what your text description alluded to.

Running multiple linear regressions across several columns of a data frame in R

Answers (2)

Related Questions