Reputation: 15
I am trying to write a function which takes a binary response variable y and a single explanatory variable x, runs 10-fold cross validation and returns the proportion of the response variables y that are incorrectly classified. I then want to run this function to find out which of the explanatory variables are best at predicting low birth weight.
So far I have tried
attach(birthwt)
folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds
bwt<-glm(low~age+lwt+race+smoke+ptl+ht+ui+ftv,family=binomial)
But I really don't know where to go from here..
Thank you
Upvotes: 1
Views: 4785
Reputation: 10263
For each group, you need to construct a training and test data set, fit the model on the training set, and then use the predict.glm
function to do the actual predictions in the test data. The comments below should be enough explanation.
library("MASS") # for the dataset
library("dplyr") # for the filter function
data(birthwt)
set.seed(1)
n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))), breaks=n.folds, labels=FALSE) # Note!
all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
# Create training set
train <- filter(birthwt, folds != i) # Take all other samples than i
# Create test set
test <- filter(birthwt, folds == i)
# Fit the glm model on the training set
glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv,
family = binomial, data = train)
# Use the fitted model on the test set
logit.prob <- predict(glm.train, newdata = test)
# Classifiy based on the predictions
pred.class <- ifelse(logit.prob < 0, 0, 1)
# Construct the confusion table
all.confustion.tables[[i]] <- table(pred = pred.class, true = test$low)
}
Note, that I randomly select the rows to group. This is important.
We can then see the prediction versus the true values (given in the confusion tables) for the, say, 5th cross-validation run:
all.confustion.tables[[5]]
# true
# pred 0 1
# 0 7 8
# 1 4 0
Hence, in this particular run, we have 7 true positives, 4 false positives, 8 false negatives, and 0 true negatives.
We can now compute whatever measure of performance we want from our confusion tables.
# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
mean(risk)
#[1] 0.3119883
So, in this case, we have a misclassification risk of about 31 %. I'll leave it to you to wrap this into a function.
Upvotes: 2