user86533
user86533

Reputation: 333

Caret - Subsampling for Class Imbalances

I use caret for ML-classification of a binary variable in a set of 100 patients. Since this variable is imbalanced (13 / 87 samples in each group) I perform subsampling with SMOTE and ROSE.

The mean ROC of the different classification models with svmRadial are: 62.5% without subsampling, 76.4% with ROSE and 77.8% with SMOTE. If I look at the accuracy of the held-out-predictions following 3-times repeated 10-fold CV I get the best results without subsampling (87%), whereas SMOTE and ROSE performed much worse (71% and 39%).

Could someone explain to me why a higher ROC for SMOTE and ROSE translates into a lower auccuracy in the held-out-predictions? Also I would have expected that SMOTE and ROSE would alter number of samples as well as the sample distribution also for the held out predictions, however when I look at my confusion matrix the total number of all samples is always n=300 (without subsampling but also with SMOTE and ROSE).

Don´t care too much about the poor accuracy of the classifier (it should just serve as an example to illustrate my questions...)

Thanks for your help,

Philipp

my_method <- "svmRadial"
ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE,
                     summaryFunction = twoClassSummary, savePredictions = "final")
set.seed(1)
orig_fit <- train(Class ~ ., data = chosen_train,
                  method = my_method,
                  trControl = ctrl, metric="ROC", preProc = c("center", "scale"),vebose=F)

ctrl$sampling <- "rose"
set.seed(1)
rose_inside <- train(Class ~ ., data = chosen_train,
                     method = my_method,
                     trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)

ctrl$sampling <- "smote"
set.seed(1)
smote_inside <- train(Class ~ ., data = chosen_train,
                     method = my_method,
                     trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)

inside_models <- list(original = orig_fit, rose = rose_inside, smote=smote_inside)
set.seed(1)
inside_resampling <- resamples(inside_models)
>summary(inside_resampling, metric = "ROC")

           Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
original 0.4444  0.5556 0.6250 0.6569  0.7431    1    0
rose     0.3889  0.6667 0.7639 0.7757  0.8889    1    0
smote    0.4444  0.6667 0.7778 0.7845  0.8889    1    0


>confusionMatrix(rose_inside$pred$pred,rose_inside$pred$obs)

          Reference
Prediction MAIN OTHER
  MAIN       15   158
  OTHER          24   103

               Accuracy : 0.3933          
                 95% CI : (0.3377, 0.4511)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 1               

                  Kappa : -0.0897         
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.38462         
            Specificity : 0.39464         
         Pos Pred Value : 0.08671         
         Neg Pred Value : 0.81102         
             Prevalence : 0.13000         
         Detection Rate : 0.05000         
   Detection Prevalence : 0.57667         
      Balanced Accuracy : 0.38963         

       'Positive' Class : MAIN        

> confusionMatrix(smote_inside$pred$pred,smote_inside$pred$obs)
Confusion Matrix and Statistics

          Reference
Prediction MAIN OTHER
  MAIN        6    55
  OTHER          33   206

               Accuracy : 0.7067          
                 95% CI : (0.6516, 0.7576)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 1.00000         

                  Kappa : -0.0459         
 Mcnemar's Test P-Value : 0.02518         

            Sensitivity : 0.15385         
            Specificity : 0.78927         
         Pos Pred Value : 0.09836         
         Neg Pred Value : 0.86192         
             Prevalence : 0.13000         
         Detection Rate : 0.02000         
   Detection Prevalence : 0.20333         
      Balanced Accuracy : 0.47156         

       'Positive' Class : MAIN        

> confusionMatrix(orig_fit$pred$pred,orig_fit$pred$obs)
Confusion Matrix and Statistics

          Reference
Prediction MAIN OTHER
  MAIN        0     0
  OTHER          39   261

               Accuracy : 0.87            
                 95% CI : (0.8266, 0.9059)
    No Information Rate : 0.87            
    P-Value [Acc > NIR] : 0.5426          

                  Kappa : 0               
 Mcnemar's Test P-Value : 1.166e-09       

            Sensitivity : 0.00            
            Specificity : 1.00            
         Pos Pred Value :  NaN            
         Neg Pred Value : 0.87            
             Prevalence : 0.13            
         Detection Rate : 0.00            
   Detection Prevalence : 0.00            
      Balanced Accuracy : 0.50            

       'Positive' Class : MAIN 

Upvotes: 0

Views: 1333

Answers (1)

Dana Averbuch
Dana Averbuch

Reputation: 116

The accuracy here doesn't mean much, since it is the same as your problem's No Information Rate (87/100).

"higher ROC for SMOTE and ROSE translates into a lower accuracy in the held-out-predictions"- I don't think it is a general and correct observation.

Upvotes: 1

Related Questions