Reputation: 351
I've been training randomForest models in R on 7 million rows of data (41 features). Here's an example call:
myModel <- randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30)
I thought surely with only 50 trees and 30 terminal nodes that the memory footprint of "myModel" would be small. But it's 65 megs in a dump file. The object seems to be holding all sorts of predicted, actual, and vote data from the training process.
What if I just want the forest and that's it? I want a tiny dump file that I can load later to make predictions off of quickly. I feel like the forest by itself shouldn't be all that large...
Anyone know how to strip this sucker down to just something I can make predictions off of going forward?
Upvotes: 5
Views: 3216
Reputation: 56
You can make use of tuneRF
function in R to know the number of trees and make the size smaller.
tuneRF(data_train, data_train$Response, stepFactor = 1.2, improve = 0.01, plot = T, trace = T)
use ?tuneRF
to know more about inside variables.
Upvotes: 1
Reputation: 176648
Trying to get out of the habit of posting answers as comments...
?randomForest
advises against using the formula interface with large numbers of variables... are the results any different if you don't use the formula interface? The Value section of ?randomForest
also tells you how to turn off some of the output (importance matrix, the entire forest, proximity matrix, etc.).
For example:
myModel <- randomForest(mydata[,!grepl("RESPONSE",names(mydata))],
mydata$RESPONSE, ntree=50, maxnodes=30, importance=FALSE,
localImp=FALSE, keep.forest=FALSE, proximity=FALSE, keep.inbag=FALSE)
Upvotes: 1