Reputation: 1143
I am using R to perform logistic regression on my data set. My data set has more than 50 variables.
The challenge is to write code in R that can assess the statistical validity of certain records and variables (e.g., p values >.05) and eliminate records and variables from the model based on parameters such as that.
Is there any already implemented method to do this? Any help or suggestion will be appreciated. Thank you.
Upvotes: 0
Views: 2114
Reputation: 691
Here is the implementation of a basic function that will take a set of predictor variables and eliminate those variables step-by-step until a linear model is found that only has predictors below the desired significance level.
reverse.step <- function(y, b, df, alpha=0.05) {
# y = dependent variable name (as character) e.g. 'Height',
# b = vector of explanatory variable names (as characters) e.g.
# c('x1','x2','x3',...), # df = data frame
sum <- summary(lm(paste(paste(y,' ~ ', sep=''),
paste(b, collapse='+'), sep=''), data=df))
cat(b)
cat("\n")
pvals <- sum$coeff[2:nrow(sum$coeff),4]
if (pvals[which.max(pvals)] < alpha) {
return(sum)
}
new.b <- names(pvals[-which.max(pvals)])
if (length(new.b) == 0 | length(new.b) == length(b)) {
return(sum)
} else {
return(reverse.step(y, new.b, df, alpha))
}
}
It may not be the most robust function, but it will get you started.
You could also check out the regsubsets method in the library leaps.
Upvotes: 2