Jack Armstrong
Jack Armstrong

Reputation: 1249

Building a RandomForest with caret

I was attempting to build a RandomForest model in caret following the steps here. Essentially, they set up the RandomForest, then the best mtry, then best maxnodes, then best number of trees. These steps make sense, but wouldn't it be better to search the interaction of those three factors rather than one at a time?

Secondly, I understand performing a grid search for mtry and ntrees. But I do not know what to set the minimum number of nodes or maximum number of nodes at. Is it generally advisable leave the default nodesize as shown below?

library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')

tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
  ntree<-ntrees[i]
  set.seed(65)
  rf_maxtrees <- train(Species~.,
                       data = df,
                       method = "rf",
                       importance=TRUE,
                       metric = "Accuracy",
                       tuneGrid = tuneGrid,
                       trControl = trainControl( method = "cv",
                                                 number=5,
                                                 search = 'grid',
                                                 classProbs = TRUE,
                                                 savePredictions = "final"),
                       ntree = ntree
                       )
  Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
  Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
  Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
  Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}

Upvotes: 7

Views: 19698

Answers (1)

missuse
missuse

Reputation: 19716

  1. Yes, it would be better to search over the interactions of parameters.

  2. nodesize and maxnodes are usually left at default but there is no reason not to tune them. Personally I would leave maxnodes at default and perhaps tune nodesize - it can be seen as a regularization parameter. To get an idea of what values to try, check the default value in rf these are 1 for classification and 5 for regression. So trying 1-10 would be an option.

  3. when performing tuning in a loop like in your example it is advisable to use the same cross-validation folds always. You can create them using createFolds prior calling the loop.

  4. After tuning be sure to evaluate your results on an independent validation set or perform nested cross validation where the inner loop would be used to tune parameters and the outer loop to estimate model performance. Since the results from just cross validation will be optimistically biased.

  5. In most cases Accuracy is not a suitable metric to chose the best classification model. Especially in the case of imbalanced data sets. Read up on receiver operating characteristic auc, Cohen's kappa, Matthews correlation coefficient, balanced accuracy, F1 score, classification threshold tuning.

  6. Here is an example on how to tune the rf parameters jointly. I will use the Sonar data set from mlbench package.

create predefined folds:

library(caret) 
library(mlbench)
data(Sonar)

set.seed(1234)
cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)

create tune control:

tuneGrid <- expand.grid(.mtry = c(1 : 10))

ctrl <- trainControl(method = "cv",
                     number = 5,
                     search = 'grid',
                     classProbs = TRUE,
                     savePredictions = "final",
                     index = cv_folds,
                     summaryFunction = twoClassSummary) #in most cases a better summary for two class problems 

define other parameters to tune. I will use just a few combinations to limit the train time of the example:

ntrees <- c(500, 1000)    
nodesize <- c(1, 5)

params <- expand.grid(ntrees = ntrees,
                      nodesize = nodesize)

train:

store_maxnode <- vector("list", nrow(params))
for(i in 1:nrow(params)){
  nodesize <- params[i,2]
  ntree <- params[i,1]
  set.seed(65)
  rf_model <- train(Class~.,
                       data = Sonar,
                       method = "rf",
                       importance=TRUE,
                       metric = "ROC",
                       tuneGrid = tuneGrid,
                       trControl = ctrl,
                       ntree = ntree,
                       nodesize = nodesize)
  store_maxnode[[i]] <- rf_model
  }

################### 26.02.2021.

In order to avoid generic model names - model1, model2 ... we can name the resulting list with the corresponding parameters:

names(store_maxnode) <- paste("ntrees:", params$ntrees,
                              "nodesize:", params$nodesize)

################### 26.02.2021.

combine results:

results_mtry <- resamples(store_maxnode)

summary(results_mtry)

output:

Call:
summary.resamples(object = results_mtry)

Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5 
Number of resamples: 5 

ROC 
                              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273    0
ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182    0
ntrees: 500 nodesize: 5  0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545    0
ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818    0

Sens 
                              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000    0
ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455    0
ntrees: 500 nodesize: 5  0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0
ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0

Spec 
                         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000    0
ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000    0
ntrees: 500 nodesize: 5  0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053    0
ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000    0

To get the best mtry for each model:

lapply(store_maxnode, function(x) x$best)
#output
$`ntrees: 500 nodesize: 1`
  mtry
1    1

$`ntrees: 1000 nodesize: 1`
  mtry
2    2

$`ntrees: 500 nodesize: 5`
  mtry
1    1

$`ntrees: 1000 nodesize: 5`
  mtry
1    1

################### 26.02.2021.
or alternatively to get the best average performance for each model

lapply(store_maxnode, function(x) x$results[x$results$ROC == max(x$results$ROC),])
#output
$`ntrees: 500 nodesize: 1`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9465758 0.9549407 0.7421053 0.02541895 0.03215337 0.0802308

$`ntrees: 1000 nodesize: 1`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
2    2 0.9474828 0.9371542 0.7631579 0.03728797 0.02385499 0.1209382

$`ntrees: 500 nodesize: 5`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9481652 0.9458498 0.7331579 0.02133659 0.02056666 0.1177407

$`ntrees: 1000 nodesize: 5`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9462321 0.9458498 0.7321053 0.03091747 0.02056666 0.0961229

From this toy example you see that the highest average (over the 5 folds) area under the ROC curve (ROC) is achieved with ntrees: 500, nodesize: 5 and mtry: 1 and that it is equal to 0.948. ###################

Alternatively you can use the default summary

ctrl <- trainControl(method = "cv",
                         number = 5,
                         search = 'grid',
                         classProbs = TRUE,
                         savePredictions = "final",
                         index = cv_folds)

and define metric = "Kappa" in train

Upvotes: 17

Related Questions