randomForest model size depends on the training set size: A way to avoid?

Question

I am training a randomForest model with the goal of saving it for prediction (it will be downloaded and used in an external context). I would like this model to be the smallest possible.

I have read that there are a number of options and packages to reduce the memory size of the model.

Nevertheless, I don't understand why the size of the training set is tied to the size of the model? After all, once the coefficients of the forest are there, why the need to keep the original dataset?

df <- iris
model <- randomForest::randomForest(Species ~ ., data = df, 
                 localImp = FALSE,
                 importance = FALSE,
                 keep.forest = TRUE,
                 keep.inbag = FALSE,
                 proximity=FALSE,
                 ntree = 25)
object.size(model)/1000
#> 73.2 bytes

df <- df[sample(nrow(df), 50), ]
model <- randomForest::randomForest(Species ~ ., data = df, 
                 localImp = FALSE,
                 importance = FALSE,
                 keep.forest = TRUE,
                 keep.inbag = FALSE,
                 proximity=FALSE,
                 ntree = 25)
object.size(model)/1000
#> 43 bytes

^{Created on 2019-05-21 by the reprex package (v0.2.1)}

I have tried the tricks mentioned above to reduce the size, but their effect is marginal compared to the role of the training set size. Is there a way to remove this information?

RLave · Accepted Answer

I think that you can remove some parts of the model after you fit:

object.size(model)/1000
# 70.4 bytes

model$predicted <- NULL # remove predicted
model$y <- NULL # remove y
#.. possibly other parts aren't needed
object.size(model)/1000
# 48.3 bytes

I checked with predict(model, df) to see if it'd still work, and it does.

Use names(model) to check the elements inside model.

It seems that $votes is big, and you don't need it, here more items I safely removed:

model$predicted <- NULL
model$y <- NULL
model$err.rate <- NULL
model$test <- NULL
model$proximity <- NULL
model$confusion <- NULL
model$localImportance <- NULL
model$importanceSD <- NULL
model$inbag <- NULL
model$votes <- NULL
model$oob.times <- NULL


object.size(model)/1000
# 32.3 bytes

Example:

df <- iris
model <- randomForest::randomForest(Species ~ ., data = df, 
                 localImp = FALSE,
                 importance = FALSE,
                 keep.forest = TRUE,
                 keep.inbag = FALSE,
                 proximity=FALSE,
                 ntree = 25)

randomForest model size depends on the training set size: A way to avoid?

Answers (1)

Related Questions