Reputation: 1249
I was attempting to build a RandomForest model in caret following the steps here. Essentially, they set up the RandomForest, then the best mtry, then best maxnodes, then best number of trees. These steps make sense, but wouldn't it be better to search the interaction of those three factors rather than one at a time?
Secondly, I understand performing a grid search for mtry and ntrees. But I do not know what to set the minimum number of nodes or maximum number of nodes at. Is it generally advisable leave the default nodesize as shown below?
library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')
tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
ntree<-ntrees[i]
set.seed(65)
rf_maxtrees <- train(Species~.,
data = df,
method = "rf",
importance=TRUE,
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trainControl( method = "cv",
number=5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final"),
ntree = ntree
)
Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}
Upvotes: 7
Views: 19698
Reputation: 19716
Yes, it would be better to search over the interactions of parameters.
nodesize
and maxnodes
are usually left at default but there is no reason not to tune them. Personally I would leave maxnodes
at default and perhaps tune nodesize
- it can be seen as a regularization parameter. To get an idea of what values to try, check the default value in rf
these are 1 for classification and 5 for regression. So trying 1-10 would be an option.
when performing tuning in a loop like in your example it is advisable to use the same cross-validation folds always. You can create them using createFolds
prior calling the loop.
After tuning be sure to evaluate your results on an independent validation set or perform nested cross validation where the inner loop would be used to tune parameters and the outer loop to estimate model performance. Since the results from just cross validation will be optimistically biased.
In most cases Accuracy is not a suitable metric to chose the best classification model. Especially in the case of imbalanced data sets. Read up on receiver operating characteristic auc, Cohen's kappa, Matthews correlation coefficient, balanced accuracy, F1 score, classification threshold tuning.
Here is an example on how to tune the rf
parameters jointly. I will use the Sonar data set from mlbench
package.
create predefined folds:
library(caret)
library(mlbench)
data(Sonar)
set.seed(1234)
cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)
create tune control:
tuneGrid <- expand.grid(.mtry = c(1 : 10))
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds,
summaryFunction = twoClassSummary) #in most cases a better summary for two class problems
define other parameters to tune. I will use just a few combinations to limit the train time of the example:
ntrees <- c(500, 1000)
nodesize <- c(1, 5)
params <- expand.grid(ntrees = ntrees,
nodesize = nodesize)
train:
store_maxnode <- vector("list", nrow(params))
for(i in 1:nrow(params)){
nodesize <- params[i,2]
ntree <- params[i,1]
set.seed(65)
rf_model <- train(Class~.,
data = Sonar,
method = "rf",
importance=TRUE,
metric = "ROC",
tuneGrid = tuneGrid,
trControl = ctrl,
ntree = ntree,
nodesize = nodesize)
store_maxnode[[i]] <- rf_model
}
################### 26.02.2021.
In order to avoid generic model names - model1, model2 ... we can name the resulting list with the corresponding parameters:
names(store_maxnode) <- paste("ntrees:", params$ntrees,
"nodesize:", params$nodesize)
################### 26.02.2021.
combine results:
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
output:
Call:
summary.resamples(object = results_mtry)
Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273 0
ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182 0
ntrees: 500 nodesize: 5 0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545 0
ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000 0
ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455 0
ntrees: 500 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
ntrees: 500 nodesize: 1 0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000 0
ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000 0
ntrees: 500 nodesize: 5 0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053 0
ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000 0
To get the best mtry for each model:
lapply(store_maxnode, function(x) x$best)
#output
$`ntrees: 500 nodesize: 1`
mtry
1 1
$`ntrees: 1000 nodesize: 1`
mtry
2 2
$`ntrees: 500 nodesize: 5`
mtry
1 1
$`ntrees: 1000 nodesize: 5`
mtry
1 1
################### 26.02.2021.
or alternatively to get the best average performance for each model
lapply(store_maxnode, function(x) x$results[x$results$ROC == max(x$results$ROC),])
#output
$`ntrees: 500 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9465758 0.9549407 0.7421053 0.02541895 0.03215337 0.0802308
$`ntrees: 1000 nodesize: 1`
mtry ROC Sens Spec ROCSD SensSD SpecSD
2 2 0.9474828 0.9371542 0.7631579 0.03728797 0.02385499 0.1209382
$`ntrees: 500 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9481652 0.9458498 0.7331579 0.02133659 0.02056666 0.1177407
$`ntrees: 1000 nodesize: 5`
mtry ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.9462321 0.9458498 0.7321053 0.03091747 0.02056666 0.0961229
From this toy example you see that the highest average (over the 5 folds) area under the ROC curve (ROC) is achieved with ntrees: 500, nodesize: 5 and mtry: 1 and that it is equal to 0.948. ###################
Alternatively you can use the default summary
ctrl <- trainControl(method = "cv",
number = 5,
search = 'grid',
classProbs = TRUE,
savePredictions = "final",
index = cv_folds)
and define metric = "Kappa"
in train
Upvotes: 17