Reputation: 1480
I have been doing variable selection for a modeling problem.
I have used trial and error for the selection (adding / removing a variable) with a decrease in error. However, I have the challenge as the number of variables grows into the hundreds that manual variable selection can not be performed as the model takes 1/2 hour to compute, rendering the task impossible.
Would you happen to know of any other packages than the regsubsets from the leaps package (which when tested with the same trial and error variables produced a higher error, it did not include some variables which were lineraly dependant - excluding some valuable variables).
Upvotes: 2
Views: 6385
Reputation: 7714
Try using the stepAIC
function of the MASS package.
Here is a really minimal example:
library(MASS)
data(swiss)
str(swiss)
lm <- lm(Fertility ~ ., data = swiss)
lm$coefficients
## (Intercept) Agriculture Examination Education Catholic
## 66.9151817 -0.1721140 -0.2580082 -0.8709401 0.1041153
## Infant.Mortality
## 1.0770481
st1 <- stepAIC(lm, direction = "both")
st2 <- stepAIC(lm, direction = "forward")
st3 <- stepAIC(lm, direction = "backward")
summary(st1)
summary(st2)
summary(st3)
You should try the 3 directions and ckeck which model works better with your test data. Read ?stepAIC and take a look at the examples.
EDIT
It's true stepwise regression isn't the greatest method. As it's mentioned in GavinSimpson answer, lasso regression is a better/much more efficient method. It's much faster than stepwise regression and will work with large datasets. Check out the glmnet package vignette: http://www.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Upvotes: 2
Reputation: 174813
You need a better (i.e. not flawed) approach to model selection. There are plenty of options, but one that should be easy to adapt to your situation would be using some form of regularization, such as the Lasso or the elastic net. These apply shrinkage to the sizes of the coefficients; if a coefficient is shrunk from its least squares solution to zero, that variable is removed from the model. The resulting model coefficients are slightly biased but they have lower variance than the selected OLS terms.
Take a look at the lars, glmnet, and penalized packages
Upvotes: 6