User981636
User981636

Reputation: 3629

Automatic variable selection – Regression linear model

In the MWE below, I have a data set with 70 potential predictors to explain my variable price1. I would like to do univariate analysis with all the variables but the package glmulti says that I have too many predictors. How a univariate analysis can have too many predictors?

*I could do it by means of a loop/apply but I am looking for something more elaborated. This similar question here doesn’t solve the question either.

test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv"))
library(glmulti)
glmulti.lm.out <- glmulti(data  = test, price1 ~ .,
                          level = 1,
                          method = "h",
                          maxK = 1,
                          confsetsize = 10,
                          fitfunction = "lm")

Error
Warning message:
In glmulti(y = "price1", data = test, level = 1, maxK = 1, method = "h",  :
  !Too many predictors.

Upvotes: 0

Views: 989

Answers (2)

User981636
User981636

Reputation: 3629

A simple solution for univariate analysis using lapply.

test <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Car.csv")) 

reg <- function(indep_var,dep_var,data_source) {
          formula <- as.formula(paste(dep_var," ~ ", indep_var))
          res     <- lm(formula, data = data_source)
          summary(res)
}

lapply(colnames(test), FUN = reg, dep_var = "price1", data_source = test)

Upvotes: 0

kfurlong
kfurlong

Reputation: 323

This question is more geared for CrossValidated, but here's my two cents. Running an exhaustive search to find the best variables to include in a model is very computationally heavy and gets out of hand really quickly. Consider what you're asking the computer to do:

When you're running an exhaustive search, the computer is building a model for every possible combination of variables. For a model of size one, that's not too bad because that's only 70 models. But even for a two variable model, the computer has to run n!/(r!(n-r)!) = 70!/(2!(68)!) = 2415 different models. Things spiral out of control from there.

As a work-around, I'll point you to the leaps package, which has the regsubsets function. Then, you can run either a Forward or a Backward subset selection model and find the most important variables in a step-wise manner. After running each, you may be able to toss out the variables that are omitted from each and run your model with fewer predictors using glmulti, but no promises.

test.data <-
read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/ma
ster/csv/Ecdat/Car.csv"))[,2:71]
library(leaps)

big_subset_model <- regsubsets(x = price1 ~ ., data = test.data, nbest = 1, 
method = "forward", really.big = TRUE, nvmax = 70)
sum.model <- summary(big_subset_model)

Upvotes: 2

Related Questions