Building a nested logistic regression model using caret, glmnet and a (nested) cross-validation

Question

My problem

I would like to build a logistic regression model with a high AUC in predicting a binary variable.

I would like to use the following approach (if feasible):

Use an elastic net model (glmnet) to reduce the predictors and find the best hyperparameter (alpha and lambda)
Combine the output of this model (a simple linear combination) with an additional predictor (the opinion of a super doctor superdoc) in a logistic regression model (=finalmodel), similar as described on page 26 in:

Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here

Example data

As example data I have a dataset with many numeric predictors and a binary (pos/neg) outcome (diabetes).

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

^{Created on 2021-03-14 by the reprex package (v1.0.0)}

What I already tried

# train the model (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("cv",
                           number = 10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)


# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE) 
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions,     ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

^{Created on 2021-03-14 by the reprex package (v1.0.0)}

Where I am not really sure...

I think to find the most accurate model and to avoid overfitting, I have to use a nested cross-validation (as I learned here and here). However, I am not sure how to do so. At the moment everytime I use another set.seed different predictors get selected and I get different AUCs. Can this be migitated with proper use of nested cross validation?

Update 1

I just learned that nested CV does not help you get the most accurate model. The problem is, that I get variating coefficients with different set.seet in the second code sample above. I have actually the same problem as described here: Extract the coefficients for the best tuning parameters of a glmnet model in caret

One posted solution is to use repeated CV to migitate this variation. Unfortunately I could not get this run.

Update 2

Using "repeatedcv" solved my problem. Using repeated cv not nested cv did the trick!

model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("repeatedcv",
                           number = 10,
                           repeats=10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)

ava · Accepted Answer

Thanks to @missuse I could solve my questions:

Cross-validation does not help get the most accurate model. This (resp. my) misunderstanding is discussed beautifully in the blog post: The "Cross-Validation - Train/Predict" misunderstanding

The problem with seed depending variations of predictor's coefficients of glmnet in small datasets can be migitated with repeated cross-validation (ie, "repeatedcv" in caret::trainControl as described in the comments here)

Stacked learners (in my case stacked glmnet and glm) are usually built using out of fold predictions from lower level lerners. This could be done using the mlr3 package as described in this blog post: Tuning a stacked learner. Since this was not an initial question, I opened a new question here.

Building a nested logistic regression model using caret, glmnet and a (nested) cross-validation

Answers (1)

Related Questions