Reputation: 314
I do want to tune a classification algorithm predicting probabilities using caret
.
Since my data-set is highly unbalanced, the default Accuracy
option of caret
seems not to be so helpful according to this post: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification.
In my specific case, I want to determine the optimal mtry
parameter of a random forest, which predicts probabilities. I do have 3 classes and a palance ratio of 98.7% - 0.45% - 0.85%. An reproducible example - which has sadely no unbalanced data-set - is given by:
library(caret)
data(iris)
control = trainControl(method="CV", number=5,verboseIter = TRUE,classProbs=TRUE)
grid = expand.grid(mtry = 1:3)
rf_gridsearch = train(y=iris[,5],x=iris[-5],method="ranger", num.trees=2000, tuneGrid=grid, trControl=control)
rf_gridsearch
So my two questions basically are:
Accuracy
do I have?
(Using multiROC is not my favourite, due to: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification. I think of sth. like a Brier Score)Many thanks!
Upvotes: 1
Views: 810
Reputation: 69
I use Matthew's Correlation Coefficient MCC when dealing with imbalanced data sets. As an example, using rf
from package randomForest
instead of ranger
and package mltools
:
library(mltools)
library(caret)
# Define function that will return the MCC
optimize_mcc <- function(data, lev = NULL, model = NULL) {
mcc_value <- mcc(preds = data$pred, actuals = data$obs)
c(MCC = mcc_value)
}
fit <-
train(
x = iris[-5],
y = iris[,5],
method = "rf",
metric = "MCC",
trControl = trainControl(
summaryFunction = optimize_mcc,
method="CV",
number=5,
savePredictions = TRUE,
classProbs = TRUE),
tuneGrid = expand.grid(.mtry = 1:3))
Will return for fit
:
Random Forest
150 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 120, 120, 120, 120, 120
Resampling results across tuning parameters:
mtry MCC
1 0.9109190
2 0.9212364
3 0.9309524
MCC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
Upvotes: 0