Reputation: 1083
I want to run best subset regression on a set of variables and then get the best 3 variables using R. I'm having problems obtaining the best 2 variables. I've included my code below.
set.seed(10)
a <- 1:100
b <- 1:100
c <- 1:100
d <- 1:100
e <- 1:100
f <- 1:100
g <- 1:100
h <- 1:100
data <- data.frame(a, b, c, d, e, f, g, h)
library(leaps)
# best subsets regression
test <- regsubsets(a ~ b + c + d + e + f + g + h, data=data, nbest=4)
# nbest = 4, is the number of subsets of each size that is reported
# plot a table of models showing variables in each model.
summary(test)
# models are ordered by the selection statistic.
plot(test,scale="r2")
#get the variables that are important to the model
coef(test, 2)
#NOTE: THIS DOESN'T GIVE ME THE 2 BEST VARIABLES. IT ONLY GIVES ME THE BEST VARIABLE AT THE 2ND ITERATION. LOOK AT:
coef(test, 1:2)
Your help would be greatly appreciated!
Best, Dana
Upvotes: 1
Views: 2282
Reputation: 13118
Consider an example with the built-in mtcars
dataset:
test <- regsubsets(mpg ~ ., data = mtcars, nbest = 4)
This is the output from summary(test)
:
summary(test)
# Subset selection object
# Call: regsubsets.formula(mpg ~ ., data = mtcars, nbest = 4)
# 10 Variables (and intercept)
# <..snip..>
# 4 subsets of each size up to 8
# Selection Algorithm: exhaustive
# cyl disp hp drat wt qsec vs am gear carb
# 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
# 1 ( 2 ) "*" " " " " " " " " " " " " " " " " " "
# 1 ( 3 ) " " "*" " " " " " " " " " " " " " " " "
# 1 ( 4 ) " " " " "*" " " " " " " " " " " " " " "
# 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
# 2 ( 2 ) " " " " "*" " " "*" " " " " " " " " " "
# 2 ( 3 ) " " " " " " " " "*" "*" " " " " " " " "
# 2 ( 4 ) " " " " " " " " "*" " " "*" " " " " " "
# 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
# <..snip..>
The sets of coefficients are arranged by the number of independent variables, in subsets of 4 (what we indicated with the nbest
argument); hence, coef(test, 1:4)
will return coefficients from models with one independent variable, coef(test, 5:8)
will be those with two independent variables, and so on. Within each subset, the "best" model comes first. The "best" model with two independent variables will therefore be model 5:
coef(test, 5)
# (Intercept) cyl wt
# 39.686261 -1.507795 -3.190972
Upvotes: 2