Jogi
Jogi

Reputation: 314

Alternative performance measures for multiclass classification in caret

I do want to tune a classification algorithm predicting probabilities using caret. Since my data-set is highly unbalanced, the default Accuracy option of caret seems not to be so helpful according to this post: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification.

In my specific case, I want to determine the optimal mtry parameter of a random forest, which predicts probabilities. I do have 3 classes and a palance ratio of 98.7% - 0.45% - 0.85%. An reproducible example - which has sadely no unbalanced data-set - is given by:

library(caret)
data(iris)


control = trainControl(method="CV", number=5,verboseIter = TRUE,classProbs=TRUE)

grid = expand.grid(mtry = 1:3)
rf_gridsearch = train(y=iris[,5],x=iris[-5],method="ranger", num.trees=2000, tuneGrid=grid, trControl=control)
rf_gridsearch

So my two questions basically are:

  1. What alternative summary metrics besides the Accuracy do I have? (Using multiROC is not my favourite, due to: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification. I think of sth. like a Brier Score)
  2. How do I implement them?

Many thanks!

Upvotes: 1

Views: 810

Answers (1)

user7677771
user7677771

Reputation: 69

I use Matthew's Correlation Coefficient MCC when dealing with imbalanced data sets. As an example, using rf from package randomForest instead of ranger and package mltools:

library(mltools)
library(caret)

# Define function that will return the MCC
optimize_mcc <- function(data, lev = NULL, model = NULL) {
  mcc_value <- mcc(preds = data$pred, actuals = data$obs)
  c(MCC = mcc_value)
}

fit <-
  train(
    x = iris[-5],
    y = iris[,5],
    method = "rf",
    metric = "MCC",
    trControl = trainControl(
      summaryFunction = optimize_mcc,
      method="CV", 
      number=5,
      savePredictions = TRUE,
      classProbs = TRUE),
    tuneGrid = expand.grid(.mtry = 1:3))

Will return for fit:

Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 120, 120, 120, 120, 120 
Resampling results across tuning parameters:

  mtry  MCC      
  1     0.9109190
  2     0.9212364
  3     0.9309524

MCC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.

Upvotes: 0

Related Questions