Reputation: 4640
I have two questions related to randomForest in R.
How can I find the best values for two arguments: ntree and nodesize? I just put a random number here and sometimes I found a better result. Can I use kind of k-fold cross validation, or if not, what method I can use to find these values?
After I ran randomForest function and have the model, I did the prediction and I have a predicted data, then I can make a confusion table like below:
Predicted 1 2 3
Actual 1 4 3 1 2 2 4 2 3 3 2 1
(i.e, there are 4 + 4 + 1 correct predictions)
My question is, given this kind of table, how can I calculate the RMSE (Root Mean Square Error) of the prediction? Of course I can do it manually but I think it is not the best answer.
Thank you very much,
Upvotes: 0
Views: 10023
Reputation: 109232
You can do all of the above with the mlr package. The tutorial has detailed sections on tuning and performance measurements. For tuning, you should use nested resampling.
Assuming that you have a regression task, it would look something like this:
library(mlr)
# define parameters we want to tune -- you may want to adjust the bounds
ps = makeParamSet(
makeIntegerLearnerParam(id = "ntree", default = 500L, lower = 1L, upper = 1000L),
makeIntegerLearnerParam(id = "nodesize", default = 1L, lower = 1L, upper = 50L)
)
# random sampling of the configuration space with at most 100 samples
ctrl = makeTuneControlRandom(maxit = 100L)
# do a nested 3 fold cross-validation
inner = makeResampleDesc("CV", iters = 3L)
learner = makeTuneWrapper("regr.randomForest", resampling = inner, par.set = ps,
control = ctrl, show.info = FALSE, measures = rmse)
# outer resampling
outer = makeResampleDesc("CV", iters = 3)
# do the tuning, using the example boston housing task
res = resample(learner, bh.task, resampling = outer, extract = getTuneResult)
# show performance
print(performance(res$pred, measures = rmse))
The whole process would look very similar for classification, see the relevant tutorial pages for more details.
Upvotes: 3
Reputation: 3121
Yes, you can select the best parameters via k-fold cross validation. I would recommend not tuning ntree
and instead just set it relatively high (1500-2000 trees), as overfitting is not a concern with RF and that way you don't have to tune that as a parameter. You can still go ahead and tune mtry
.
There are many different measures for assessing performance of a classfication problem. If you are specifically interested in an RMSE-like measure, you could check out this CV post, which discusses the Brier Score - this is calculated like RMSE, where you use the probability that was forecast and the actual value to get a mean-squared error.
Upvotes: 3