How to find the best ntree and nodesize in randomForest in R, and then calculate RMSE for confusion table as the result?

Question

I have two questions related to randomForest in R.

How can I find the best values for two arguments: ntree and nodesize? I just put a random number here and sometimes I found a better result. Can I use kind of k-fold cross validation, or if not, what method I can use to find these values?
After I ran randomForest function and have the model, I did the prediction and I have a predicted data, then I can make a confusion table like below:

Predicted 1 2 3

 Actual   1  4 3 1

          2  2 4 2

          3  3 2 1

(i.e, there are 4 + 4 + 1 correct predictions)

My question is, given this kind of table, how can I calculate the RMSE (Root Mean Square Error) of the prediction? Of course I can do it manually but I think it is not the best answer.

Thank you very much,

Lars Kotthoff · Accepted Answer

You can do all of the above with the mlr package. The tutorial has detailed sections on tuning and performance measurements. For tuning, you should use nested resampling.

Assuming that you have a regression task, it would look something like this:

library(mlr)

# define parameters we want to tune -- you may want to adjust the bounds
ps = makeParamSet(
  makeIntegerLearnerParam(id = "ntree", default = 500L, lower = 1L, upper = 1000L),
  makeIntegerLearnerParam(id = "nodesize", default = 1L, lower = 1L, upper = 50L)
)

# random sampling of the configuration space with at most 100 samples
ctrl = makeTuneControlRandom(maxit = 100L)

# do a nested 3 fold cross-validation
inner = makeResampleDesc("CV", iters = 3L)
learner = makeTuneWrapper("regr.randomForest", resampling = inner, par.set = ps,
                          control = ctrl, show.info = FALSE, measures = rmse)

# outer resampling
outer = makeResampleDesc("CV", iters = 3)
# do the tuning, using the example boston housing task
res = resample(learner, bh.task, resampling = outer, extract = getTuneResult)

# show performance
print(performance(res$pred, measures = rmse))

The whole process would look very similar for classification, see the relevant tutorial pages for more details.

How to find the best ntree and nodesize in randomForest in R, and then calculate RMSE for confusion table as the result?

Answers (2)

Related Questions