Reputation: 491
I'm trying to write my own R function similar to the forward stepwise selection step
, but instead of using AIC as the selection criterion I have a few criteria which need to be evaluated each time an predictor variable is added. The construction principles of the model are explained as follows. The model should start with the predictor variable that has the highest correlation with the dependent variable. Then another predictor variable is added each time based on whether the new model meets the following criteria.
This process is repeated until no remaining variable meets all the three criteria. The output I need could just be the names of all predictors in the final model, the corresponding coefficients and the r2 value of the final model.
my example data (y is the dependent variable and x1 - x6 are predictors)
data = structure(list(y = c(23.6, 19.9, 40.7, 40.7, 40.7, 40.7, 40.2,
41.7, 41.7, 28.8), x1 = c(0.1, 0, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.1), x2 = c(0, 0.1, 0, 0, 0,
0, 0, 0.1, 0.1, 0), x3 = c(2277.6, 3038.1, 7797.9, 7797.9,
7797.9, 7797.9, 8392.2, 10127.2, 10127.2, 1799), x4 = c(34228.7,
49815, 76917.1, 76917.1, 76917.1, 76917.1, 75981.4, 74881.1,
74881.1, 56798.2), x5 = c(108786.5, 150465.5, 230397.1, 230397.1,
230397.1, 230397.1, 239300.9, 238493.8, 238493.8, 188799.5),
x6 = c(362.2, 198.2, 656.6, 656.6, 656.6, 656.6, 681,
655.3, 655.3, 222.3)), .Names = c("y", "x1",
"x2", "x3", "x4", "x5", "x6"), row.names = c(NA,
10L), class = "data.frame")
First attempt on my model selection function
modSel = function(data, var){
cor.result = cor(data[,var], df["y"]) #calculate correlation coeff for each variable against y
max.cor = rownames(cor.result)[which.max(cor.result)] #identify the variable with max cor
start.model = lm(as.formula(paste("y", max.cor, sep = "~")), data)
if #my criteria??
else #??]
Without much background in programming, I really don't have any ideas on how to repeat the evaluation of my criteria for an unknown amount of times. I realized that to achieve this might require quite a bit of coding but for starters I would appreciate any guidance on what the whole framework should look like.
Cheers
Upvotes: 0
Views: 507
Reputation: 6213
Hopefully this can help you get started
Function to run the algorithm
modSel <- function(data) {
# initial
cor.result <- cor(data$y, data[, -which(colnames(data) == "y")])
vars.model <- colnames(cor.result)[which.max(cor.result)]
vars.remaining <- colnames(data)[!colnames(data) %in% c("y", max.cor)]
start.model <- lm(as.formula(paste("y", vars.model, sep = "~")), data)
adj.rsq <- summary(start.model)$adj.r.squared
# algorithm
for (var in vars.remaining) {
# model
vars.test <- paste(vars.model, var, sep="+")
fit <- lm(as.formula(paste("y", vars.test, sep="~")), data)
new.rsq <- summary(fit)$adj.r.squared
# check adj rsq
cond1 <- new.rsq > adj.rsq + .01
# check coefficients
cond2 <- coefficients(fit)[var] > 0
# new var significant
cf <- summary(fit)$coefficients[, 4]
cond3 <- cf[var] < .05
if (cond1 & cond2 & cond3) {
vars.model <- vars.test
adj.rsq <- new.rsq
}
}
lm(as.formula(paste("y", vars.model, sep="~")), data)
}
Call to modSel
returns the best model from the algorithm
bestfit <- modSel(data)
Summarize the model
summary(bestfit)
Call:
lm(formula = as.formula(paste("y", vars.model, sep = "~")), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.8731 -0.3838 -0.3838 0.6273 1.5640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.331e+01 1.941e+00 6.856 0.000241 ***
x1 5.417e+01 5.834e+00 9.285 3.48e-05 ***
x4 1.498e-04 4.460e-05 3.359 0.012099 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9345 on 7 degrees of freedom
Multiple R-squared: 0.9904, Adjusted R-squared: 0.9876
F-statistic: 360.5 on 2 and 7 DF, p-value: 8.721e-08
Upvotes: 0