Logistic regression objects consuming an enormous amount of disk space after R save() function

Question

I've declined to produce sample code because so far I have been unable to replicate this example on a smaller data set. I am training several logistic regressions (50 in this example) using different covariate selections and saving the output as a list. My training data has +400K rows.

Recognizing that there is a large amount of unnecessary background data that gets stored in glm objects, my training script involves the following lines of code, which are intended to strip out as much extra data as I can and reduce the memory footprint of the output object:

fit[c('residuals', 'fitted.values', 'effects', 'weights', 'prior.weights', 'y', 'linear.predictors', 'data')] <- NULL
fit$qr$qr <- NULL
 gc()

At first this seemed like it worked OK. R/RStudio console tells me the list of glms is 9.6Mb after executing my code:

enter image description here

However, when I save this object using save(logitFire, file = 'logitFire.RData') I find that it's memory footprint is absolutely massive, (1.32GBs on disk):

enter image description here

Again, I 100% recognize that it is bad form not to supply a toy example. I tried using the iris dataset and was unable to reproduce the problem. It seems to be a feature of large data sets, but I'm not sure. Any experts out there have an idea of what's going on? My next step, if I can't solve this using the functions in the base package, will be to write my own wrappers for "leanLogit" and "leanPredict" functions that just strip out the model covariates and toss all the rest of the ancillary data.

EDIT: To clarify, the sample script that I included in this question is embedded within the model training routine that ultimately produces the list of glms logitFire. This is not complete code, but is embedded within a much larger script. I included it so that readers can see what data objects I'm stripping out.

EDIT #2: Here's some additional requested info. To be as clear as I possibly can, logitFire is a list of 50 logistic regression models that I produced in R using glm. I've shown the output from the str command on one element of this list, i.e. one logistic regression model:

> object.size(logitFire)
10113640 bytes
> str(logitFire[[1]])
List of 21
 $ coefficients : Named num [1:54] 18.361 -0.592 -1.043 -0.744 0.101 ...
  ..- attr(*, "names")= chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
 $ R            : num [1:54, 1:54] -11.3 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
  .. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
 $ rank         : int 53
 $ qr           :List of 4
  ..$ rank : int 53
  ..$ qraux: num [1:54] 1 1 1 1 1 ...
  ..$ pivot: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ tol  : num 1e-11
  ..- attr(*, "class")= chr "qr"
 $ family       :List of 12
  ..$ family    : chr "binomial"
  ..$ link      : chr "logit"
  ..$ linkfun   :function (mu)  
  ..$ linkinv   :function (eta)  
  ..$ variance  :function (mu)  
  ..$ dev.resids:function (y, mu, wt)  
  ..$ aic       :function (y, n, mu, wt, dev)  
  ..$ mu.eta    :function (eta)  
  ..$ initialize:  expression({     if (NCOL(y) == 1) {         if (is.factor(y))                 y <- y != levels(y)[1L]         n <- rep.int(1, nobs)         y[weights == 0] <- 0         if (any(y < 0 | y > 1))              stop("y values must be 0 <= y <= 1")         mustart <- (weights * y + 0.5)/(weights + 1)         m <- weights * y         if (any(abs(m - round(m)) > 0.001))              warning("non-integer #successes in a binomial glm!")     }     else if (NCOL(y) == 2) {         if (any(abs(y - round(y)) > 0.001))              warning("non-integer counts in a binomial glm!")         n <- y[, 1] + y[, 2]         y <- ifelse(n == 0, 0, y[, 1]/n)         weights <- weights * n         mustart <- (n * y + 0.5)/(n + 1)     }     else stop("for the 'binomial' family, y must be a vector of 0 and 1's
or a 2 column matrix where col 1 is no. successes and col 2 is no. failures") })
  ..$ validmu   :function (mu)  
  ..$ valideta  :function (eta)  
  ..$ simulate  :function (object, nsim)  
  ..- attr(*, "class")= chr "family"
 $ deviance     : num 1648
 $ aic          : num 1754
 $ null.deviance: num 1783
 $ iter         : int 19
 $ df.residual  : int 49947
 $ df.null      : int 49999
 $ converged    : logi TRUE
 $ boundary     : logi FALSE
 $ call         : language glm(formula = fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA +      var14isNA, family = binomial(), data = inData, model = FALSE)
 $ formula      :Class 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
  .. ..- attr(*, ".Environment")= 
 $ terms        :Classes 'terms', 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
  .. ..- attr(*, "variables")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
  .. ..- attr(*, "factors")= int [1:17, 1:16] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:17] "fire" "var3" "var1" "var12isNA" ...
  .. .. .. ..$ : chr [1:16] "var3" "var1" "var12isNA" "var4" ...
  .. ..- attr(*, "term.labels")= chr [1:16] "var3" "var1" "var12isNA" "var4" ...
  .. ..- attr(*, "order")= int [1:16] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")= 
  .. ..- attr(*, "predvars")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
  .. ..- attr(*, "dataClasses")= Named chr [1:17] "numeric" "factor" "factor" "logical" ...
  .. .. ..- attr(*, "names")= chr [1:17] "fire" "var3" "var1" "var12isNA" ...
 $ offset       : NULL
 $ control      :List of 3
  ..$ epsilon: num 1e-08
  ..$ maxit  : num 25
  ..$ trace  : logi FALSE
 $ method       : chr "glm.fit"
 $ contrasts    :List of 12
  ..$ var3     : chr "contr.treatment"
  ..$ var1     : chr "contr.treatment"
  ..$ var12isNA: chr "contr.treatment"
  ..$ var4     : chr "contr.treatment"
  ..$ var6     : chr "contr.treatment"
  ..$ dummy    : chr "contr.treatment"
  ..$ var9     : chr "contr.treatment"
  ..$ var16isNA: chr "contr.treatment"
  ..$ var8     : chr "contr.treatment"
  ..$ var7     : chr "contr.treatment"
  ..$ var15isNA: chr "contr.treatment"
  ..$ var14isNA: chr "contr.treatment"
 $ xlevels      :List of 8
  ..$ var3 : chr [1:7] "1" "2" "3" "4" ...
  ..$ var1 : chr [1:6] "1" "2" "3" "4" ...
  ..$ var4 : chr [1:15] "99" "A1" "C1" "D1" ...
  ..$ var6 : chr [1:4] "A" "B" "C" "Z"
  ..$ dummy: chr [1:2] "A" "B"
  ..$ var9 : chr [1:3] "A" "B" "Z"
  ..$ var8 : chr [1:7] "1" "2" "3" "4" ...
  ..$ var7 : chr [1:9] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "glm" "lm"

Logistic regression objects consuming an enormous amount of disk space after R save() function

Answers (1)

Related Questions