Xi Liang
Xi Liang

Reputation: 1301

How the number of nodes are determined in random forest in R

I use randomForest package to perform a binary classification. I would like to ask how randomForest() determines the number of node in each tree? I think the number of node is saved in model$forest$nrnodes. Am I correct here?

In my dataset, I have 10,000 positive and 70,000 negative samples. I build several models with default parameters except for the number of trees 50,100,200 and 500. Their performance are quite similar. The number of nodes of each model is also quite similar, around 1400.

Could some explain how this 1400 is computed? Which parameter is used to control the number of nodes in each tree? Any advice will be much appreciated!

Upvotes: 2

Views: 4765

Answers (1)

Willis
Willis

Reputation: 11

randomForest(x, y=NULL,  xtest=NULL, ytest=NULL, ntree=500,
         mtry=if (!is.null(y) && !is.factor(y))
         max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
         replace=TRUE, classwt=NULL, cutoff, strata,
         sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
         nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
         maxnodes = NULL,
         importance=FALSE, localImp=FALSE, nPerm=1,
         proximity, oob.prox=proximity,
         norm.votes=TRUE, do.trace=FALSE,
         keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
         keep.inbag=FALSE, ...)

In nodesize, the TRUE condition for that if statement is if y exists and is not a factor, or a categorical variable used for classification. Therefore the FALSE condition is nodesize=1. So it will keep splitting on your predictor variables until each node is pure, regardless of number of number of trees. They will be slightly different because of the randomness when building the trees.

Upvotes: 1

Related Questions