Reputation: 299
I did cross validation on my data using Random Forest method in Caret package, R says that the final model is built using mtry=34, does it mean that in the final Random Forest (resulted from cross-validation) only 34 variables of the parameters in my data set were used for splitting in trees?
> output
Random Forest
375 samples
592 predictors
2 classes: 'alzheimer', 'control'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 250, 250, 250
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.6826667 0.3565541
34 0.7600000 0.5194246
591 0.7173333 0.4343563
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 34.
Upvotes: 3
Views: 3163
Reputation: 9583
Since you've built your random forest using the caret
package, a tip is to use the $finalModel
to obtain the summary of your final model, which is the model that is selected using a pre-defined parameter (default: OOB Accuracy).
Now to answer your question:
From the image below, you can see the random forest randomly chooses from 34 (my example is 31, but you get the point) variables on each split. This is not to be confused with using only 34 variables to grow each tree, as per your question. In fact, all variables are used in a sufficiently large random forest; Only, on each node, one variable is picked from a pool of 34 to reduce variance of the model. This makes each tree more independent from each another and consequently, the gains from averaging over a large number of trees more significant.
The tree-growing process for each tree is as follow (bold for emphasis, and assuming you're using the randomForest
implementation from caret
or from randomForest
directly):
m
(smaller than M
) is specified such that at each node split, m
variables are selected at random out of the M
and the best candidate out of m
(measured by information gain) is used to split the node. m
is a constant during the forest growing Sorry for the 2 month late answer, but I thought this is a great question and a shame if it doesn't get a more elaborated explanation about what the mtry
parameter truly does. It's quite often misunderstood so I thought I would add an answer here!
Upvotes: 6
Reputation: 23598
Documentation of randomForest:
mtry: Number of variables randomly sampled as candidates at each split.
In this case the final model considers 34 random variables per split in the tree.
Upvotes: 1