Mariano C Giglio
Mariano C Giglio

Reputation: 135

performing all possible linear regressions between 1 variable and a list of variables

I am using the following code ( which was developed in a previous post) for the following task: performing all possible linear regressions between the first variable and the other variables and saving the results in a new data frame.

library(broom)
library(dplyr)
x <- names(data[,-1])
out <- unlist(lapply(1, function(n) combn(x, 1, FUN=function(row) 
          paste0("tlv ~ ", paste0(row, collapse = "+")))))
## get the regression coefficients
tmp1 = bind_rows(lapply(out, function(frml) {
      a = tidy(lm(frml, data=data))
      a$frml = frml
      return(a)
    }))
reg_coeff2 <- tmp1
 ## Get regression results i.e. R2, AIC, BIC
 tmp2 = bind_rows(lapply(out, function(frml) {
      a = glance(lm(frml, data=data))
      a$frml = frml
      return(a)
    }))
 reg_results2 <- tmp2
 reg_results2$frml <- sub("tlv ~ ", "", reg_results2$frml)

The code works very well, but I would like to implement it in order to do the following.

I have the following data frame (data)

structure(list(id = c(5309039, 5284969, 5300279, 5270289, 5259957, 
5267086, 5173196), var1 = c(0, 0, 0, 0, 0, 0, 0), var2 = c(23, 
24, 20, 32, 31, 37, 43), var3 = c(162, 154, 156, 154, 151.5, 
171, 154), var4 = c(62.8, 52.7, 64.5, 70.9, 63, 66.2, 60.3), 
    tlv = c(1049, 978, 1131, 1292, 1228, 1593, 1265), form20 = c(1674.12110392683, 
    1517.06018080512, 1666.03606715029, 1726.99450999549, 1627.94506984781, 
    1754.74878787639, 1608.54623766777), form19 = c(1062.84280028848, 
    902.364998653641, 1054.58187260355, 1116.8664734097, 1015.66220125765, 
    1145.22454880977, 995.841345244203), form18 = c(1050.91941325579, 
    891.3634649201, 1026.84722464179, 1073.58291322486, 980.997498562542, 
    1147.23019335865, 971.271632531001), form17 = c(1404.10436829839, 
    1220.98291088203, 1419.72032143583, 1517.11065788694, 1386.31581471687, 
    1477.21675910098, 1347.52393410332), form16 = c(1248.12292187059, 
    1126.73082253566, 1229.80850901466, 1265.36558733196, 1194.92548170827, 
    1321.39733067342, 1187.52592495257), form15 = c(990.132, 
    866.003, 1011.025, 1089.681, 992.59, 1031.918, 959.407), 
    form14 = c(1590.6052, 1436.4718, 1582.993, 1830.3706, 1688.692, 
    1812.3808, 1786.5202), form13 = c(1300.81321145176, 1130.23869905075, 
    1292.03253463863, 1358.23586808642, 1250.66417156907, 1388.37813595599, 
    1277.89625553694), form12 = c(1329.6, 1104.4, 1272, 1322.8, 
    1195.5, 1487.4, 1195.6)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

and I need to perform linear regression between the variable tlv and all the variables whose name start with the prefix "form" , so excluding the other variables (i.e. var1, var2, var3, ...)

Upvotes: 3

Views: 137

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270298

We assume that the objective of getting the subsets is to find the best variables so instead of proceeding like that let us just find the "best" variables using using stepwise regression.

Since data has fewer rows than columns and var1 is all 0's let us use the data frame data2 shown below for our example.

First create the full model fm0 and the use stepwise regression speciying the var variables as the lower bound, i.e. every model must contain them.

This runs quickly on this data and uses no packages.

data2 <- data[c("var2", "var3", "tlv", "form20", "form19")]

fm0 <- lm(tlv ~., data2)
varnames <- grep("var", names(data2), value = TRUE)
step(fm0, list(lower = reformulate(varnames)))

giving this model:

Call:
lm(formula = tlv ~ var2 + var3 + form20, data = data2)

Coefficients:
(Intercept)         var2         var3       form20  
  -2235.694       13.881        6.728        1.197  

Upvotes: 0

Parfait
Parfait

Reputation: 107767

Consider the apply family to build needed formulas of all possible combinations then pass into lm iteratively. Except for broom functions, below demonstrates base R:

indvar_list <- lapply(1:9, function(x) combn(paste0("form", 12:20), x, simplify = FALSE)) 

formulas_list <- rapply(indvar_list, function(x) as.formula(paste("tlv ~", paste(x, collapse="+")))) 

tmp1 <- do.call(rbind, lapply(formulas_list, function(f)
   transform(tidy(lm(f, data=data)), frml = f)
))

tmp2 <- do.call(rbind, (lapply(formulas_list, function(f)
   transform(glance(lm(f, data=data)), frml = f)
))

Upvotes: 2

akrun
akrun

Reputation: 887951

We can make it shorter with map

library(purrr)
tmp1 <- map_dfr(set_names(out, out),  ~ lm(.x, data = data) %>% tidy, .id = 'fmla')
tmp2 <- map_dfr(set_names(out, out),  ~ lm(.x, data = data) %>% glance, .id = 'fmla')

Or if we need only form variables, get the names of the columns that startsWith "form", pass it in reformulate to create a formula in lm, tidy the output and create the "Var" column signifying the column name (or if we need the formula itself, assign reformulate output to an object and call it later

startsWith(names(data), "form") %>%
    magrittr::extract(names(data), .) %>%
    map_dfr(~  lm(reformulate(.x, 'tlv'), data = data) %>% 
                  tidy %>%
                  mutate(Var = .x))

Similarly change tidy to glance

Upvotes: 1

Related Questions