user2763361
user2763361

Reputation: 3919

Massive datasets with the randomForest package

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.

To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?

Upvotes: 0

Views: 299

Answers (1)

Stephen Henderson
Stephen Henderson

Reputation: 6522

Usually you can get away with just mtryas explained here and the default is often best:

https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees

But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:

setting values for ntree and mtry for random forest regression model

The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.

The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Upvotes: 2

Related Questions