Reputation: 333
I use caret for ML-classification of a binary variable in a set of 100 patients. Since this variable is imbalanced (13 / 87 samples in each group) I perform subsampling with SMOTE and ROSE.
The mean ROC of the different classification models with svmRadial are: 62.5% without subsampling, 76.4% with ROSE and 77.8% with SMOTE. If I look at the accuracy of the held-out-predictions following 3-times repeated 10-fold CV I get the best results without subsampling (87%), whereas SMOTE and ROSE performed much worse (71% and 39%).
Could someone explain to me why a higher ROC for SMOTE and ROSE translates into a lower auccuracy in the held-out-predictions? Also I would have expected that SMOTE and ROSE would alter number of samples as well as the sample distribution also for the held out predictions, however when I look at my confusion matrix the total number of all samples is always n=300 (without subsampling but also with SMOTE and ROSE).
Don´t care too much about the poor accuracy of the classifier (it should just serve as an example to illustrate my questions...)
Thanks for your help,
Philipp
my_method <- "svmRadial"
ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE,
summaryFunction = twoClassSummary, savePredictions = "final")
set.seed(1)
orig_fit <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),vebose=F)
ctrl$sampling <- "rose"
set.seed(1)
rose_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
ctrl$sampling <- "smote"
set.seed(1)
smote_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
inside_models <- list(original = orig_fit, rose = rose_inside, smote=smote_inside)
set.seed(1)
inside_resampling <- resamples(inside_models)
>summary(inside_resampling, metric = "ROC")
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
original 0.4444 0.5556 0.6250 0.6569 0.7431 1 0
rose 0.3889 0.6667 0.7639 0.7757 0.8889 1 0
smote 0.4444 0.6667 0.7778 0.7845 0.8889 1 0
>confusionMatrix(rose_inside$pred$pred,rose_inside$pred$obs)
Reference
Prediction MAIN OTHER
MAIN 15 158
OTHER 24 103
Accuracy : 0.3933
95% CI : (0.3377, 0.4511)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1
Kappa : -0.0897
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.38462
Specificity : 0.39464
Pos Pred Value : 0.08671
Neg Pred Value : 0.81102
Prevalence : 0.13000
Detection Rate : 0.05000
Detection Prevalence : 0.57667
Balanced Accuracy : 0.38963
'Positive' Class : MAIN
> confusionMatrix(smote_inside$pred$pred,smote_inside$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 6 55
OTHER 33 206
Accuracy : 0.7067
95% CI : (0.6516, 0.7576)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1.00000
Kappa : -0.0459
Mcnemar's Test P-Value : 0.02518
Sensitivity : 0.15385
Specificity : 0.78927
Pos Pred Value : 0.09836
Neg Pred Value : 0.86192
Prevalence : 0.13000
Detection Rate : 0.02000
Detection Prevalence : 0.20333
Balanced Accuracy : 0.47156
'Positive' Class : MAIN
> confusionMatrix(orig_fit$pred$pred,orig_fit$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 0 0
OTHER 39 261
Accuracy : 0.87
95% CI : (0.8266, 0.9059)
No Information Rate : 0.87
P-Value [Acc > NIR] : 0.5426
Kappa : 0
Mcnemar's Test P-Value : 1.166e-09
Sensitivity : 0.00
Specificity : 1.00
Pos Pred Value : NaN
Neg Pred Value : 0.87
Prevalence : 0.13
Detection Rate : 0.00
Detection Prevalence : 0.00
Balanced Accuracy : 0.50
'Positive' Class : MAIN
Upvotes: 0
Views: 1333
Reputation: 116
The accuracy here doesn't mean much, since it is the same as your problem's No Information Rate (87/100).
"higher ROC for SMOTE and ROSE translates into a lower accuracy in the held-out-predictions"- I don't think it is a general and correct observation.
Upvotes: 1