Reputation: 1651
I want to create a regression model within another function; but my problem is that when saving the model it becomes really, really big because other data in the environment is being saved with it. Thus, I think the solution might be to handle different environments; this helped me understand this better. Below I have explained the problems in a few steps.
# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
tf <- tempfile(fileext = ".RData")
on.exit(unlink(tf))
save(object, file = tf)
file.size(tf)
}
# Subset of columns to be used
subset = 1:4
# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965
# Function where there are other data that should NOT be saved.
Function2 <- function (subset){
data_not_to_be_saved <- 1:1e+15
model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2)
# Size = 1148 ; Problematic that size is larger that model 1.
# Solution to above is to create a new environment
Function3 <- function (subset){
data_not_to_be_saved <- 1:1e+15
# New environment
env <- new.env(parent = globalenv())
env$subset <- subset
with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3)
# 1002 # Success: considerably smaller than in Function 2.
# PROBLEM: Getting solution in Function 3 to work within another function.
# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call.
Function5 <- function (subset){
data_not_to_be_saved <- 1:1e+15
Function5 <- function (subset) {
env <- new.env(parent = globalenv())
env$subset <- subset
env$datainenvorment <- iris
with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
}
model5 <- Function5(subset)
}
model5 <- Function5(subset)
saveSize(model5)
Thanks in advance
Upvotes: 1
Views: 114
Reputation: 4184
The solution you are using works correctly. You do not see it as in new R versions sequential integer vectors are very memory efficient. This small differences comes from a small overhead of additional variables like env
variable. Where most important is that data_not_to_be_saved
variable is skipped.
Use some bigger data to see it more clearly.
data_not_to_be_saved <- rnorm(10**5)
What is the source of this problem. The lm
returns an object which contains reference to other environments (e.g. function environments provide an access to all variables from the place where it was defined). Additionally save
function with default parameters looking for needed variables across all possible envs.
str(model5)
# like .. .. ..- attr(*, ".Environment")=<environment: 0x7fdc9e6c2b68>
Another solution might be to use lm.fit
function which returning only base structures. Here no additional reference will be taken
model_fit <- lm.fit(cbind(1,iris$Sepal.Width[subset]), iris$Sepal.Length[subset])
Upvotes: 1