Oscar Kjell
Oscar Kjell

Reputation: 1651

Avoid larger (bloated) size when saving (regression) model in R (environments)

I want to create a regression model within another function; but my problem is that when saving the model it becomes really, really big because other data in the environment is being saved with it. Thus, I think the solution might be to handle different environments; this helped me understand this better. Below I have explained the problems in a few steps.

# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
  tf <- tempfile(fileext = ".RData")
  on.exit(unlink(tf))
  save(object, file = tf)
  file.size(tf)
}

# Subset of columns to be used
subset = 1:4

# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965

# Function where there are other data that should NOT be saved. 
Function2 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2) 
# Size = 1148 ; Problematic that size is larger that model 1.

# Solution to above is to create a new environment
Function3 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  # New environment
  env <- new.env(parent = globalenv())
  env$subset <- subset
  with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3) 
# 1002 # Success: considerably smaller than in Function 2. 



# PROBLEM: Getting solution in Function 3 to work within another function. 

# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call. 
Function5 <- function (subset){
  
  data_not_to_be_saved <- 1:1e+15
  
  Function5 <- function (subset) {
    
    env <- new.env(parent = globalenv())
    env$subset <- subset
    env$datainenvorment <- iris
    
    with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
  }
  model5 <- Function5(subset)
}

model5 <- Function5(subset)
saveSize(model5) 

Thanks in advance

Upvotes: 1

Views: 114

Answers (1)

polkas
polkas

Reputation: 4184

The solution you are using works correctly. You do not see it as in new R versions sequential integer vectors are very memory efficient. This small differences comes from a small overhead of additional variables like env variable. Where most important is that data_not_to_be_saved variable is skipped.

Use some bigger data to see it more clearly.

data_not_to_be_saved <- rnorm(10**5)

What is the source of this problem. The lm returns an object which contains reference to other environments (e.g. function environments provide an access to all variables from the place where it was defined). Additionally save function with default parameters looking for needed variables across all possible envs.

str(model5)
# like   .. .. ..- attr(*, ".Environment")=<environment: 0x7fdc9e6c2b68> 

Another solution might be to use lm.fit function which returning only base structures. Here no additional reference will be taken

model_fit <- lm.fit(cbind(1,iris$Sepal.Width[subset]), iris$Sepal.Length[subset])

Upvotes: 1

Related Questions