Mohammad Saifullah
Mohammad Saifullah

Reputation: 1143

Logistic regression: Drop Insignificant prediction Variables

I am using R to perform logistic regression on my data set. My data set has more than 50 variables.

The challenge is to write code in R that can assess the statistical validity of certain records and variables (e.g., p values >.05) and eliminate records and variables from the model based on parameters such as that.

Is there any already implemented method to do this? Any help or suggestion will be appreciated. Thank you.

Upvotes: 0

Views: 2114

Answers (1)

petew
petew

Reputation: 691

Here is the implementation of a basic function that will take a set of predictor variables and eliminate those variables step-by-step until a linear model is found that only has predictors below the desired significance level.

reverse.step <- function(y, b, df, alpha=0.05) {
  # y = dependent variable name (as character) e.g. 'Height', 
  # b = vector of explanatory variable names (as characters) e.g. 
  # c('x1','x2','x3',...), # df = data frame
  sum <- summary(lm(paste(paste(y,' ~ ', sep=''), 
                          paste(b, collapse='+'), sep=''), data=df))
  cat(b)
  cat("\n")
  pvals <- sum$coeff[2:nrow(sum$coeff),4]
  if (pvals[which.max(pvals)] < alpha) {
    return(sum)
  }
  new.b <- names(pvals[-which.max(pvals)])
  if (length(new.b) == 0 | length(new.b) == length(b)) {
    return(sum)
  } else {
    return(reverse.step(y, new.b, df, alpha))
  }
}

It may not be the most robust function, but it will get you started.

You could also check out the regsubsets method in the library leaps.

Upvotes: 2

Related Questions