Reputation: 1
I am running a logistic regression a binary DV with two predictors (gender, political leaning: binary, continuous). I need help getting my GLMs to run in a cross-validation! I can't my code to work despite reclassifying the variables multiple times. I'm not sure what's going on.
Here is the code I have:
`
#######################################################
# Cross-Validation of the Logistic Regression
#######################################################
gen <- as.numeric(choicelife.data$gender)
lnc <- as.numeric(choicelife.data$lc)
procprol <-as.numeric(choicelife.data$views)
# This code could be useful
nCV <- 50
MSE_1 <- numeric(nCV)
MSE_2 <- numeric(nCV)
folds <- cut(sample(n),breaks=nCV,labels=FALSE)
#Perform n.folds fold cross validation
i <- 1
for(i in 1:nCV){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- choicelife.data[testIndexes, ]
trainData <- choicelife.data[-testIndexes, ]
# Models
mod1<- glm(views ~ gen,
family=binomial(link=logit), data=trainData)
mod2<- glm(views ~ gen + lnc,
family=binomial(link=logit), data=trainData)
# Get predictions
pred_1 <- predict(mod1, newdata = testData)
pred_2 <- predict(mod2, newdata = testData)
# Calculate MSE
MSE_1[i] <- mean((testData$views - pred_1)^2)
MSE_2[i] <- mean((testData$views - pred_2)^2)
}
warnings()
# mean MSEs
mean(MSE_1)
mean(MSE_2)
# get differences
diffs <- MSE_1 - MSE_2
# get 95% CIs
meandiff <- mean(diffs)
sddiff <- sd(diffs)
c(meandiff-2*sddiff, meandiff+2*sddiff) # 95% Confidence interval (n, n)
Upvotes: 0
Views: 2455
Reputation: 47008
you converted some of the variables to numeric but did not place them inside the data.frame. Inside your iteration over nCV, the subsetted data frames does not contain the numeric variable, and will not work.
First, I simulate something that should look like your data frame choicelife:
choicelife.data = data.frame(
lc=sample(1:10,100,replace=TRUE),
gender=sample(c("M","F"),100,replace=TRUE),
views = sample(c("Pro","Against"),100,replace=TRUE)
)
See below for suggested edit:
choicelife.data$gen <- as.numeric(choicelife.data$gender)
choicelife.data$lnc <- as.numeric(choicelife.data$lc)
# make this 0 or 1
choicelife.data$procprol <-as.numeric(choicelife.data$views)-1
# This code could be useful
nCV <- 5
MSE_1 <- numeric(nCV)
MSE_2 <- numeric(nCV)
folds <- cut(sample(1:nrow(choicelife.data)),breaks=nCV,labels=FALSE)
for(i in 1:nCV){
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- choicelife.data[testIndexes, ]
trainData <- choicelife.data[-testIndexes, ]
# Models
mod1<- glm(procprol ~ gen,
family=binomial(link=logit), data=trainData)
mod2<- glm(procprol ~ gen + lnc,
family=binomial(link=logit), data=trainData)
# Get predictions
pred_1 <- predict(mod1, newdata = testData,type="response")
pred_2 <- predict(mod2, newdata = testData,type="response")
# Calculate MSE
MSE_1[i] <- mean((testData$procprol - pred_1)^2)
MSE_2[i] <- mean((testData$procprol - pred_2)^2)
}
Upvotes: 1