Totti
Totti

Reputation: 15

Simple linear regression in R with many x varibales and one y. Only write one model and not for each x and y combination?

I would like to analyse many x variables (400 variables) against one y variable (1 variable). However I do not want to write for each and every x variable a new model. Is it possible to write one model which than checks all x variables with y in R-Studio?

Upvotes: 0

Views: 763

Answers (3)

Len Greski
Len Greski

Reputation: 10855

Here is an approach where we use a function that regresses all variables in a data frame on a dependent variable from the same data frame that is passed as an argument to the function.

We use lapply() to drive lm() because it will return the resulting model objects as a list, and we are able to easily name the resulting list so we can extract models by independent variable name.

regList <- function(dataframe,depVar) {
     indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
     
     modelList <- lapply(indepVars,function(x){
          lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
     })
     # name list elements based on independent variable names 
     names(modelList) <- indepVars
     modelList
}

We demonstrate the function with the mtcars data frame, assigning the mpg column as the dependent variable.

modelList <- regList(mtcars,"mpg")

At this point the modelList object contains 10 models, one for each variable in the mtcars data frame other than mpg. We can access the individual models by independent variable name, or by index.

# print the model where cyl is independent variable 
summary(modelList[["cyl"]])

...and the output:

> summary(modelList[["cyl"]])

Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     37.8846     2.0738   18.27  < 2e-16 ***
dataframe[[x]]  -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

Extracting the content

Saving the output in a list() enables us to do things like find the model with the highest R^2 without having to use vgrep.

First, we extract the r.squared value from each model summary and save the results to a vector.

r.squareds <- unlist(lapply(modelList,function(x) summary(x)$r.squared)) 

Because we used names() to name elements in the original list, R automatically saves the variable names to the element names of the vector. This comes in handy when we sort the vector by descending order of R^2 and print the first element of the resulting vector.

r.squareds[order(r.squareds,decreasing=TRUE)][1]

...and the winner (not surprisingly) is wt.

> r.squareds[order(r.squareds,decreasing=TRUE)][1]
       wt 
0.7528328 

Upvotes: 2

chan1142
chan1142

Reputation: 643

If your data frame is DF,

regs <- list()
for (v in setdiff(names(DF), "y")) {
  fm <- eval(parse(text = sprintf("y ~ %s", v)))
  regs[[v]] <- lm(fm, data=DF)
}

Now you have all simple regression results in the regs list.

Example:

## Generate data
n <- 1000
set.seed(1)
DF <- data.frame(y = rnorm(n))
for (j in seq(400)) DF[[paste0('x',j)]] <- rnorm(n)
## Now data ready

dim(DF)
# [1] 1000 401
head(names(DF))
# [1] "y"  "x1" "x2" "x3" "x4" "x5"
tail(names(DF))
# [1] "x395" "x396" "x397" "x398" "x399" "x400"

regs <- list()
for (v in setdiff(names(DF), "y")) {
  fm <- eval(parse(text = sprintf("y ~ %s", v)))
  regs[[v]] <- lm(fm, data=DF)
}

head(names(regs))
# [1] "x1" "x2" "x3" "x4" "x5" "x6"

r2s <- sapply(regs, function(x) summary(x)$r.squared)
head(r2s, 3)
#           x1           x2           x3 
# 0.0000409755 0.0024376111 0.0005509134 

Upvotes: 0

C.Robin
C.Robin

Reputation: 1102

If you want to include them in the models separately, you can just loop over the x variables and add them into the model on each iteration. For example:

x_variables = list("x_var1", "x_var2", "x_var3", "x_var4", ...)
for(x in x_variables){
model <- lm(y_variable ~ x, data = df)
summary(model)
}

You can fill in the elipses in the code above with all your other x variables. I hope for your sake that there is some kind of naming convention you can exploit to select the variables using a dplyr verb like starts_with or contains!

If you hope to include all the x variables in the same model, you just add them in as you normally would. For example (assuming you want to use an OLS, but the same premise would work for other types):

model <- lm(y_variable ~ 
      x_var1, x_var2, x_var3, x_var4, ..., data = df)
summary(model)

Upvotes: 0

Related Questions