Reputation: 15
I would like to analyse many x variables (400 variables) against one y variable (1 variable). However I do not want to write for each and every x variable a new model. Is it possible to write one model which than checks all x variables with y in R-Studio?
Upvotes: 0
Views: 763
Reputation: 10855
Here is an approach where we use a function that regresses all variables in a data frame on a dependent variable from the same data frame that is passed as an argument to the function.
We use lapply()
to drive lm()
because it will return the resulting model objects as a list, and we are able to easily name the resulting list so we can extract models by independent variable name.
regList <- function(dataframe,depVar) {
indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
modelList <- lapply(indepVars,function(x){
lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
})
# name list elements based on independent variable names
names(modelList) <- indepVars
modelList
}
We demonstrate the function with the mtcars
data frame, assigning the mpg
column as the dependent variable.
modelList <- regList(mtcars,"mpg")
At this point the modelList
object contains 10 models, one for each variable in the mtcars
data frame other than mpg
. We can access the individual models by independent variable name, or by index.
# print the model where cyl is independent variable
summary(modelList[["cyl"]])
...and the output:
> summary(modelList[["cyl"]])
Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
dataframe[[x]] -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Saving the output in a list()
enables us to do things like find the model with the highest R^2 without having to use vgrep.
First, we extract the r.squared
value from each model summary and save the results to a vector.
r.squareds <- unlist(lapply(modelList,function(x) summary(x)$r.squared))
Because we used names()
to name elements in the original list, R automatically saves the variable names to the element names of the vector. This comes in handy when we sort the vector by descending order of R^2 and print the first element of the resulting vector.
r.squareds[order(r.squareds,decreasing=TRUE)][1]
...and the winner (not surprisingly) is wt
.
> r.squareds[order(r.squareds,decreasing=TRUE)][1]
wt
0.7528328
Upvotes: 2
Reputation: 643
If your data frame is DF,
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
Now you have all simple regression results in the regs
list.
Example:
## Generate data
n <- 1000
set.seed(1)
DF <- data.frame(y = rnorm(n))
for (j in seq(400)) DF[[paste0('x',j)]] <- rnorm(n)
## Now data ready
dim(DF)
# [1] 1000 401
head(names(DF))
# [1] "y" "x1" "x2" "x3" "x4" "x5"
tail(names(DF))
# [1] "x395" "x396" "x397" "x398" "x399" "x400"
regs <- list()
for (v in setdiff(names(DF), "y")) {
fm <- eval(parse(text = sprintf("y ~ %s", v)))
regs[[v]] <- lm(fm, data=DF)
}
head(names(regs))
# [1] "x1" "x2" "x3" "x4" "x5" "x6"
r2s <- sapply(regs, function(x) summary(x)$r.squared)
head(r2s, 3)
# x1 x2 x3
# 0.0000409755 0.0024376111 0.0005509134
Upvotes: 0
Reputation: 1102
If you want to include them in the models separately, you can just loop over the x variables and add them into the model on each iteration. For example:
x_variables = list("x_var1", "x_var2", "x_var3", "x_var4", ...)
for(x in x_variables){
model <- lm(y_variable ~ x, data = df)
summary(model)
}
You can fill in the elipses in the code above with all your other x variables. I hope for your sake that there is some kind of naming convention you can exploit to select the variables using a dplyr verb like starts_with
or contains
!
If you hope to include all the x variables in the same model, you just add them in as you normally would. For example (assuming you want to use an OLS, but the same premise would work for other types):
model <- lm(y_variable ~
x_var1, x_var2, x_var3, x_var4, ..., data = df)
summary(model)
Upvotes: 0