Reputation: 135
I am using the following code ( which was developed in a previous post) for the following task: performing all possible linear regressions between the first variable and the other variables and saving the results in a new data frame.
library(broom)
library(dplyr)
x <- names(data[,-1])
out <- unlist(lapply(1, function(n) combn(x, 1, FUN=function(row)
paste0("tlv ~ ", paste0(row, collapse = "+")))))
## get the regression coefficients
tmp1 = bind_rows(lapply(out, function(frml) {
a = tidy(lm(frml, data=data))
a$frml = frml
return(a)
}))
reg_coeff2 <- tmp1
## Get regression results i.e. R2, AIC, BIC
tmp2 = bind_rows(lapply(out, function(frml) {
a = glance(lm(frml, data=data))
a$frml = frml
return(a)
}))
reg_results2 <- tmp2
reg_results2$frml <- sub("tlv ~ ", "", reg_results2$frml)
The code works very well, but I would like to implement it in order to do the following.
I have the following data frame (data)
structure(list(id = c(5309039, 5284969, 5300279, 5270289, 5259957,
5267086, 5173196), var1 = c(0, 0, 0, 0, 0, 0, 0), var2 = c(23,
24, 20, 32, 31, 37, 43), var3 = c(162, 154, 156, 154, 151.5,
171, 154), var4 = c(62.8, 52.7, 64.5, 70.9, 63, 66.2, 60.3),
tlv = c(1049, 978, 1131, 1292, 1228, 1593, 1265), form20 = c(1674.12110392683,
1517.06018080512, 1666.03606715029, 1726.99450999549, 1627.94506984781,
1754.74878787639, 1608.54623766777), form19 = c(1062.84280028848,
902.364998653641, 1054.58187260355, 1116.8664734097, 1015.66220125765,
1145.22454880977, 995.841345244203), form18 = c(1050.91941325579,
891.3634649201, 1026.84722464179, 1073.58291322486, 980.997498562542,
1147.23019335865, 971.271632531001), form17 = c(1404.10436829839,
1220.98291088203, 1419.72032143583, 1517.11065788694, 1386.31581471687,
1477.21675910098, 1347.52393410332), form16 = c(1248.12292187059,
1126.73082253566, 1229.80850901466, 1265.36558733196, 1194.92548170827,
1321.39733067342, 1187.52592495257), form15 = c(990.132,
866.003, 1011.025, 1089.681, 992.59, 1031.918, 959.407),
form14 = c(1590.6052, 1436.4718, 1582.993, 1830.3706, 1688.692,
1812.3808, 1786.5202), form13 = c(1300.81321145176, 1130.23869905075,
1292.03253463863, 1358.23586808642, 1250.66417156907, 1388.37813595599,
1277.89625553694), form12 = c(1329.6, 1104.4, 1272, 1322.8,
1195.5, 1487.4, 1195.6)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
and I need to perform linear regression between the variable tlv
and all the variables whose name start with the prefix "form" , so excluding the other variables (i.e. var1
, var2
, var3
, ...)
Upvotes: 3
Views: 137
Reputation: 270298
We assume that the objective of getting the subsets is to find the best variables so instead of proceeding like that let us just find the "best" variables using using stepwise regression.
Since data
has fewer rows than columns and var1
is all 0's let us use the data frame data2
shown below for our example.
First create the full model fm0
and the use stepwise regression speciying the var
variables as the lower bound, i.e. every model must contain them.
This runs quickly on this data and uses no packages.
data2 <- data[c("var2", "var3", "tlv", "form20", "form19")]
fm0 <- lm(tlv ~., data2)
varnames <- grep("var", names(data2), value = TRUE)
step(fm0, list(lower = reformulate(varnames)))
giving this model:
Call:
lm(formula = tlv ~ var2 + var3 + form20, data = data2)
Coefficients:
(Intercept) var2 var3 form20
-2235.694 13.881 6.728 1.197
Upvotes: 0
Reputation: 107767
Consider the apply family to build needed formulas of all possible combinations then pass into lm
iteratively. Except for broom
functions, below demonstrates base R:
indvar_list <- lapply(1:9, function(x) combn(paste0("form", 12:20), x, simplify = FALSE))
formulas_list <- rapply(indvar_list, function(x) as.formula(paste("tlv ~", paste(x, collapse="+"))))
tmp1 <- do.call(rbind, lapply(formulas_list, function(f)
transform(tidy(lm(f, data=data)), frml = f)
))
tmp2 <- do.call(rbind, (lapply(formulas_list, function(f)
transform(glance(lm(f, data=data)), frml = f)
))
Upvotes: 2
Reputation: 887951
We can make it shorter with map
library(purrr)
tmp1 <- map_dfr(set_names(out, out), ~ lm(.x, data = data) %>% tidy, .id = 'fmla')
tmp2 <- map_dfr(set_names(out, out), ~ lm(.x, data = data) %>% glance, .id = 'fmla')
Or if we need only form
variables, get the names of the columns that startsWith
"form", pass it in reformulate
to create a formula in lm
, tidy
the output and create the "Var" column signifying the column name (or if we need the formula itself, assign reformulate
output to an object and call it later
startsWith(names(data), "form") %>%
magrittr::extract(names(data), .) %>%
map_dfr(~ lm(reformulate(.x, 'tlv'), data = data) %>%
tidy %>%
mutate(Var = .x))
Similarly change tidy
to glance
Upvotes: 1