Reputation: 1673
I am training a randomForest model with the goal of saving it for prediction (it will be downloaded and used in an external context). I would like this model to be the smallest possible.
I have read that there are a number of options and packages to reduce the memory size of the model.
Nevertheless, I don't understand why the size of the training set is tied to the size of the model? After all, once the coefficients of the forest are there, why the need to keep the original dataset?
df <- iris
model <- randomForest::randomForest(Species ~ ., data = df,
localImp = FALSE,
importance = FALSE,
keep.forest = TRUE,
keep.inbag = FALSE,
proximity=FALSE,
ntree = 25)
object.size(model)/1000
#> 73.2 bytes
df <- df[sample(nrow(df), 50), ]
model <- randomForest::randomForest(Species ~ ., data = df,
localImp = FALSE,
importance = FALSE,
keep.forest = TRUE,
keep.inbag = FALSE,
proximity=FALSE,
ntree = 25)
object.size(model)/1000
#> 43 bytes
Created on 2019-05-21 by the reprex package (v0.2.1)
I have tried the tricks mentioned above to reduce the size, but their effect is marginal compared to the role of the training set size. Is there a way to remove this information?
Upvotes: 1
Views: 746
Reputation: 8364
I think that you can remove some parts of the model after you fit:
object.size(model)/1000
# 70.4 bytes
model$predicted <- NULL # remove predicted
model$y <- NULL # remove y
#.. possibly other parts aren't needed
object.size(model)/1000
# 48.3 bytes
I checked with predict(model, df)
to see if it'd still work, and it does.
Use names(model)
to check the elements inside model
.
It seems that $votes
is big, and you don't need it, here more items I safely removed:
model$predicted <- NULL
model$y <- NULL
model$err.rate <- NULL
model$test <- NULL
model$proximity <- NULL
model$confusion <- NULL
model$localImportance <- NULL
model$importanceSD <- NULL
model$inbag <- NULL
model$votes <- NULL
model$oob.times <- NULL
object.size(model)/1000
# 32.3 bytes
Example:
df <- iris
model <- randomForest::randomForest(Species ~ ., data = df,
localImp = FALSE,
importance = FALSE,
keep.forest = TRUE,
keep.inbag = FALSE,
proximity=FALSE,
ntree = 25)
Upvotes: 1