Confusion matrix for random forest in R Caret

I have data with binary YES/NO Class response. Using following code for running RF model. I have problem in getting confusion matrix result.

 dataR <- read_excel("*:/*.xlsx")
 Train    <- createDataPartition(dataR$Class, p=0.7, list=FALSE)  
 training <- dataR[ Train, ]
 testing  <- dataR[ -Train, ]

model_rf  <- train(  Class~.,  tuneLength=3,  data = training, method = 
"rf",  importance=TRUE,  trControl = trainControl (method = "cv", number = 


Random Forest 

3006 samples
82 predictor
2 classes: 'NO', 'YES' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404 
Addtional sampling using SMOTE

Resampling results across tuning parameters:

 mtry  Accuracy   Kappa    
  2    0.7870921  0.2750655
  44    0.7787721  0.2419762
 87    0.7767760  0.2524898

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.

So far fine, but when I run this code:

# Apply threshold of 0.50: p_class
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")

# Create confusion matrix
p <-confusionMatrix(class_log, testing[["Class"]])

##gives the accuracy

I get this error:

 Error in model_rf[, 1] : incorrect number of dimensions

I appreciate if you guys can help me to get confusion matrix result.

As I understand you would like to obtain the confusion matrix for cross validation in caret.

For this you need to specify savePredictions in trainControl. If it is set to "final" predictions for the best model are saved. By specifying classProbs = T probabilities for each class will be also saved.

iris_2 <- iris[iris$Species != "setosa",] #make a two class problem
iris_2$Species <- factor(iris_2$Species) #drop levels

model_rf  <- train(Species~., tuneLength = 3, data = iris_2, method = 
                       "rf", importance = TRUE,
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            savePredictions = "final",
                                            classProbs = T))

Predictions are in:


sorted as per CV fols, to sort as in original data frame:


to obtain a confusion matrix:

confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species)
Confusion Matrix and Statistics

Prediction   versicolor virginica
  versicolor         46         6
  virginica           4        44

               Accuracy : 0.9            
                 95% CI : (0.8238, 0.951)
    No Information Rate : 0.5            
    P-Value [Acc > NIR] : <2e-16         

                  Kappa : 0.8            
 Mcnemar's Test P-Value : 0.7518         

            Sensitivity : 0.9200         
            Specificity : 0.8800         
         Pos Pred Value : 0.8846         
         Neg Pred Value : 0.9167         
             Prevalence : 0.5000         
         Detection Rate : 0.4600         
   Detection Prevalence : 0.5200         
      Balanced Accuracy : 0.9000         

       'Positive' Class : versicolor 

In a two class setting often specifying 0.5 as the threshold probability is sub-optimal. The optimal threshold can be found after training by optimizing Kappa or Youden's J statistic (or any other preferred) as a function of the probability. Here is an example:

sapply(1:40/40, function(x){
  versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4]
  class <- ifelse(versicolor >=x, "versicolor", "virginica")
  mat <- confusionMatrix(class, iris_2$Species)
  kappa <- mat$overall[2]
  res <- data.frame(prob = x, kappa = kappa)

Here the highest kappa is not obtained at threshold == 0.5 but at 0.1. This should be used carefully because it can lead to over-fitting.

You need to apply your model to the test set.

prediction.rf <- predict(model_rf, testing, type = "prob")

Then do class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")

The code piece class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") is an if-else statement that performs the following test:

In the first column of model_rf, if the number is greater than 0.50, return "YES", else return "NO", and save the results in object class_log.

So the code essentially creates a character vector of class labels, "YES" and "NO", based on a numeric vector.

You can try this to create confusion matrix and check accuracy

m <- table(class_log, testing[["Class"]])
m   #confusion table


