Reputation: 40628
Does the randomForest
package ignore the nodesize
parameter? When I predict the terminal nodes for a dataset and check the counts, I see values that are less than the nodesize
. I would submit a fix for this myself but the underlying code was written in Fortran. If someone can confirm this behavior I will reach out to the package maintainer and hopefully start a fix.
> library(randomForest)
> set.seed(1)
> rf <- randomForest(mtcars[,-1], mtcars[,1], nodesize = 5)
> nodes <- attr(predict(rf, mtcars[,-1], nodes = TRUE), 'nodes')
# node counts of first tree
> table(nodes[,1])
# first row is the terminal node ID#, second row is the count
2 6 9 10 11 14 15 16 18 19
5 3 3 6 4 2 3 1 3 2
Adding system info:
Session info----------------------------------------------------------------
setting value
version R version 3.1.1 (2014-07-10)
system x86_64, mingw32
ui RStudio (0.98.1049)
language (EN)
collate English_United States.1252
tz America/Chicago
Packages--------------------------------------------------------------------
package * version date source
randomForest * 4.6.10 2014-07-17 CRAN (R 3.1.1)
Upvotes: 4
Views: 729
Reputation: 40628
Response from package maintainer:
That parameter behaves as the way that Leo Breiman intended. The bug is in how the parameter was described. It’s the same as
minsplit
in therpart:::rpart.control()
function:the minimum number of observations that must exist in a node in order for a split to be attempted.
I will change the description in the help file in the next version to resolve this confusion.
Best, Andy
Upvotes: 1