moho wu
moho wu

Reputation: 491

Multivariate linear model stepwise selection based on predefined criteria

I'm trying to write my own R function similar to the forward stepwise selection step, but instead of using AIC as the selection criterion I have a few criteria which need to be evaluated each time an predictor variable is added. The construction principles of the model are explained as follows. The model should start with the predictor variable that has the highest correlation with the dependent variable. Then another predictor variable is added each time based on whether the new model meets the following criteria.

  1. The adjusted r2 value must increase by over 1%;
  2. The coefficients for the newly added variable and existing variables must be positive;
  3. The added variable must be significant i.e. p-value < 0.05.

This process is repeated until no remaining variable meets all the three criteria. The output I need could just be the names of all predictors in the final model, the corresponding coefficients and the r2 value of the final model.

my example data (y is the dependent variable and x1 - x6 are predictors)

data = structure(list(y = c(23.6, 19.9, 40.7, 40.7, 40.7, 40.7, 40.2, 
41.7, 41.7, 28.8), x1 = c(0.1, 0, 0.3, 0.3, 0.3, 
0.3, 0.3, 0.3, 0.3, 0.1), x2 = c(0, 0.1, 0, 0, 0, 
0, 0, 0.1, 0.1, 0), x3 = c(2277.6, 3038.1, 7797.9, 7797.9, 
7797.9, 7797.9, 8392.2, 10127.2, 10127.2, 1799), x4 = c(34228.7, 
49815, 76917.1, 76917.1, 76917.1, 76917.1, 75981.4, 74881.1, 
74881.1, 56798.2), x5 = c(108786.5, 150465.5, 230397.1, 230397.1, 
230397.1, 230397.1, 239300.9, 238493.8, 238493.8, 188799.5), 
x6 = c(362.2, 198.2, 656.6, 656.6, 656.6, 656.6, 681, 
655.3, 655.3, 222.3)), .Names = c("y", "x1", 
"x2", "x3", "x4", "x5", "x6"), row.names = c(NA, 
10L), class = "data.frame")

First attempt on my model selection function

modSel = function(data, var){
cor.result = cor(data[,var], df["y"]) #calculate correlation coeff for each variable against y
max.cor = rownames(cor.result)[which.max(cor.result)] #identify the variable with max cor
start.model = lm(as.formula(paste("y", max.cor, sep = "~")), data)
if #my criteria??
else #??]

Without much background in programming, I really don't have any ideas on how to repeat the evaluation of my criteria for an unknown amount of times. I realized that to achieve this might require quite a bit of coding but for starters I would appreciate any guidance on what the whole framework should look like.

Cheers

Upvotes: 0

Views: 507

Answers (1)

Whitebeard
Whitebeard

Reputation: 6213

Hopefully this can help you get started

Function to run the algorithm

modSel <- function(data) {
  # initial
  cor.result <- cor(data$y, data[, -which(colnames(data) == "y")])
  vars.model <- colnames(cor.result)[which.max(cor.result)]
  vars.remaining <- colnames(data)[!colnames(data) %in% c("y", max.cor)]
  start.model <- lm(as.formula(paste("y", vars.model, sep = "~")), data)
  adj.rsq <- summary(start.model)$adj.r.squared

  # algorithm
  for (var in vars.remaining) {
    # model
    vars.test <- paste(vars.model, var, sep="+")
    fit <- lm(as.formula(paste("y", vars.test, sep="~")), data)
    new.rsq <- summary(fit)$adj.r.squared

    # check adj rsq
    cond1 <- new.rsq > adj.rsq + .01

    # check coefficients 
    cond2 <- coefficients(fit)[var] > 0

    # new var significant
    cf <- summary(fit)$coefficients[, 4]
    cond3 <- cf[var] < .05

    if (cond1 & cond2 & cond3) {
       vars.model <- vars.test
       adj.rsq <- new.rsq
    }
  }
  lm(as.formula(paste("y", vars.model, sep="~")), data)
}

Call to modSel returns the best model from the algorithm

bestfit <- modSel(data)

Summarize the model

summary(bestfit)
Call:
lm(formula = as.formula(paste("y", vars.model, sep = "~")), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8731 -0.3838 -0.3838  0.6273  1.5640 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.331e+01  1.941e+00   6.856 0.000241 ***
x1          5.417e+01  5.834e+00   9.285 3.48e-05 ***
x4          1.498e-04  4.460e-05   3.359 0.012099 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9345 on 7 degrees of freedom
Multiple R-squared:  0.9904,    Adjusted R-squared:  0.9876 
F-statistic: 360.5 on 2 and 7 DF,  p-value: 8.721e-08

Upvotes: 0

Related Questions