Reputation: 2096
A two-part question: I'm trying to figure out:
(1) how to generate a ROC curve for a linear regression using lm()
(properly, if it's even right??), and
(2) how to do it with k-fold cross validation so I may get the mean ROC curve (and AUC).
If the outcome is a continuous variable, it has to be converted into a binary variable, right? Normally I would fit a logistic regression model using glm(..., family = 'binomial')
instead, but is it the most appropriate way? (It seems like I'm just fitting a different model.)
I would like something like this plot below from the cvAUC
package's rdrr.io website (red line is the mean ROC curve, dotted lines are k-fold ROC curves), but I'm not sure how to get there with my data.
Example with data(USArrests)
:
library(dplyr)
library(pROC)
data(USArrests)
# create train and test sets
set.seed(2021)
dat <- mutate(USArrests, index=1:nrow(USArrests))
train.dat <- sample_frac(dat, 0.5) # splits `dat` in half
test.dat <- subset(dat, !dat$index %in% train.dat$index) # uses other half to test
# trying to build predictions with lm()
fit <- lm(Murder ~ Assault, data = train.dat)
predicted <- predict(fit, test.dat, type = "response")
# roc curve
roc(test.dat$Murder ~ predicted, plot = TRUE, print.auc = TRUE) # AUC = 1.000
The code above gets results, but gives a warning:
Warning message: In roc.default(response, m[[predictors]], ...) : 'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead
I don't know what to do from its suggestion. It also got an AUC = 1.000 -- is this approach wrong, and why?
Moreover, it's only working with one train/test set. I'm not sure how to train with k-fold sets. I think I have to combine it with caret::train()
somehow. I tried with the ROC solutions for random forest models from ROC curve from training data in caret, but it is not working with my code.
Example:
library(caret)
library(MLeval)
train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfFit <- train(Murder ~ Assault, data = USArrests, trControl = train_control, method = "lm")
rfFit$pred$mtry # NULL
res <- MLeval::evalm(rfFit) # error with error message below
MLeval: Machine Learning Model Evaluation
Input: caret train function object
Not averaging probs.
Group 1 type: cv
Error in[.data.frame
(preds, c(G1, G2, "obs")) : undefined columns selected
Upvotes: 0
Views: 3646
Reputation: 22034
You could do the cross-validation like this if you switched it to a 0/1 variable:
USArrests <- USArrests %>%
mutate(Murder01 = as.numeric(Murder > mean(Murder, na.rm=TRUE)))
# create train and test sets
set.seed(2021)
cvfun <- function(split, ...){
mod <- glm(Murder01 ~ Assault, data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
library(rsample)
library(purrr)
library(tidyverse)
cv_out <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
cv_out
# # A tibble: 5 x 2
# id auc
# * <chr> <dbl>
# 1 Repeat1 0.936
# 2 Repeat2 0.928
# 3 Repeat3 0.937
# 4 Repeat4 0.918
# 5 Repeat5 0.942
That said, I'm not sure this is better than using something like the R-squared or MSE on the linear model. And, I'm not not super confident that the algorithm in the tutorial is actually doing something that makes sense statistically. I could definitely be wrong and would defer to someone with more expertise, but I can't see how it makes a lot of sense and it certainly doesn't produce something meaningful in this case. An AUC of 1 you would think would only happen with perfect prediction.
Further, I'm not sure what probative value these numbers have. Generally you would want to use this sort of analysis to tune the model specification - often by finding nearly optimal values of hyper-parameters. You could imagine doing this with a different model specification. For example, you could evaluate the relative predictive power of a model with a second-degree polynomial in Assault
versus one that was linear, as below.
cvfun2 <- function(split, ...){
mod <- glm(Murder01 ~ poly(Assault, 2), data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
cv_out2 <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun2)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
mean(cv_out2$auc)
# [1] 0.9123994
mean(cv_out$auc)
# [1] 0.9320451
cv_out_plot <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(sens = roc(y, fit, plot=FALSE)$sensitivities,
spec = roc(y, fit, plot=FALSE)$specificities,
obs = 1:length(sens))
ave <- cv_out_plot %>%
ungroup %>%
group_by(obs) %>%
summarise(sens = mean(sens),
spec = mean(spec),
id = "Average")
cv_out_plot <- bind_rows(cv_out_plot, ave) %>%
mutate(col = factor(ifelse(id == "Average", "Average", "Individual"),
levels=c("Individual", "Average")))
ggplot(cv_out_plot , aes(x=1-sens, y=spec, group=id, colour=col)) +
geom_line(aes(size=col, alpha=col)) +
scale_colour_manual(values=c("black", "red")) +
scale_size_manual(values=c(.5,1.25)) +
scale_alpha_manual(values=c(.3, 1)) +
theme_classic() +
theme(legend.position=c(.75, .15)) +
labs(x="1-Sensitivity", y="Specificity", colour="", alpha="", size="")
Upvotes: 1