aaron
aaron

Reputation: 6489

Logistic regression objects consuming an enormous amount of disk space after R save() function

I've declined to produce sample code because so far I have been unable to replicate this example on a smaller data set. I am training several logistic regressions (50 in this example) using different covariate selections and saving the output as a list. My training data has +400K rows.

Recognizing that there is a large amount of unnecessary background data that gets stored in glm objects, my training script involves the following lines of code, which are intended to strip out as much extra data as I can and reduce the memory footprint of the output object:

fit[c('residuals', 'fitted.values', 'effects', 'weights', 'prior.weights', 'y', 'linear.predictors', 'data')] <- NULL
fit$qr$qr <- NULL
 gc() 

At first this seemed like it worked OK. R/RStudio console tells me the list of glms is 9.6Mb after executing my code:

enter image description here

However, when I save this object using save(logitFire, file = 'logitFire.RData') I find that it's memory footprint is absolutely massive, (1.32GBs on disk):

enter image description here

Again, I 100% recognize that it is bad form not to supply a toy example. I tried using the iris dataset and was unable to reproduce the problem. It seems to be a feature of large data sets, but I'm not sure. Any experts out there have an idea of what's going on? My next step, if I can't solve this using the functions in the base package, will be to write my own wrappers for "leanLogit" and "leanPredict" functions that just strip out the model covariates and toss all the rest of the ancillary data.

EDIT: To clarify, the sample script that I included in this question is embedded within the model training routine that ultimately produces the list of glms logitFire. This is not complete code, but is embedded within a much larger script. I included it so that readers can see what data objects I'm stripping out.

EDIT #2: Here's some additional requested info. To be as clear as I possibly can, logitFire is a list of 50 logistic regression models that I produced in R using glm. I've shown the output from the str command on one element of this list, i.e. one logistic regression model:

> object.size(logitFire)
10113640 bytes
> str(logitFire[[1]])
List of 21
 $ coefficients : Named num [1:54] 18.361 -0.592 -1.043 -0.744 0.101 ...
  ..- attr(*, "names")= chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
 $ R            : num [1:54, 1:54] -11.3 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
  .. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
 $ rank         : int 53
 $ qr           :List of 4
  ..$ rank : int 53
  ..$ qraux: num [1:54] 1 1 1 1 1 ...
  ..$ pivot: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ tol  : num 1e-11
  ..- attr(*, "class")= chr "qr"
 $ family       :List of 12
  ..$ family    : chr "binomial"
  ..$ link      : chr "logit"
  ..$ linkfun   :function (mu)  
  ..$ linkinv   :function (eta)  
  ..$ variance  :function (mu)  
  ..$ dev.resids:function (y, mu, wt)  
  ..$ aic       :function (y, n, mu, wt, dev)  
  ..$ mu.eta    :function (eta)  
  ..$ initialize:  expression({     if (NCOL(y) == 1) {         if (is.factor(y))                 y <- y != levels(y)[1L]         n <- rep.int(1, nobs)         y[weights == 0] <- 0         if (any(y < 0 | y > 1))              stop("y values must be 0 <= y <= 1")         mustart <- (weights * y + 0.5)/(weights + 1)         m <- weights * y         if (any(abs(m - round(m)) > 0.001))              warning("non-integer #successes in a binomial glm!")     }     else if (NCOL(y) == 2) {         if (any(abs(y - round(y)) > 0.001))              warning("non-integer counts in a binomial glm!")         n <- y[, 1] + y[, 2]         y <- ifelse(n == 0, 0, y[, 1]/n)         weights <- weights * n         mustart <- (n * y + 0.5)/(n + 1)     }     else stop("for the 'binomial' family, y must be a vector of 0 and 1's\nor a 2 column matrix where col 1 is no. successes and col 2 is no. failures") })
  ..$ validmu   :function (mu)  
  ..$ valideta  :function (eta)  
  ..$ simulate  :function (object, nsim)  
  ..- attr(*, "class")= chr "family"
 $ deviance     : num 1648
 $ aic          : num 1754
 $ null.deviance: num 1783
 $ iter         : int 19
 $ df.residual  : int 49947
 $ df.null      : int 49999
 $ converged    : logi TRUE
 $ boundary     : logi FALSE
 $ call         : language glm(formula = fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA +      var14isNA, family = binomial(), data = inData, model = FALSE)
 $ formula      :Class 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
  .. ..- attr(*, ".Environment")=<environment: 0x7f98f5685ce8> 
 $ terms        :Classes 'terms', 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
  .. ..- attr(*, "variables")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
  .. ..- attr(*, "factors")= int [1:17, 1:16] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:17] "fire" "var3" "var1" "var12isNA" ...
  .. .. .. ..$ : chr [1:16] "var3" "var1" "var12isNA" "var4" ...
  .. ..- attr(*, "term.labels")= chr [1:16] "var3" "var1" "var12isNA" "var4" ...
  .. ..- attr(*, "order")= int [1:16] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x7f98f5685ce8> 
  .. ..- attr(*, "predvars")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
  .. ..- attr(*, "dataClasses")= Named chr [1:17] "numeric" "factor" "factor" "logical" ...
  .. .. ..- attr(*, "names")= chr [1:17] "fire" "var3" "var1" "var12isNA" ...
 $ offset       : NULL
 $ control      :List of 3
  ..$ epsilon: num 1e-08
  ..$ maxit  : num 25
  ..$ trace  : logi FALSE
 $ method       : chr "glm.fit"
 $ contrasts    :List of 12
  ..$ var3     : chr "contr.treatment"
  ..$ var1     : chr "contr.treatment"
  ..$ var12isNA: chr "contr.treatment"
  ..$ var4     : chr "contr.treatment"
  ..$ var6     : chr "contr.treatment"
  ..$ dummy    : chr "contr.treatment"
  ..$ var9     : chr "contr.treatment"
  ..$ var16isNA: chr "contr.treatment"
  ..$ var8     : chr "contr.treatment"
  ..$ var7     : chr "contr.treatment"
  ..$ var15isNA: chr "contr.treatment"
  ..$ var14isNA: chr "contr.treatment"
 $ xlevels      :List of 8
  ..$ var3 : chr [1:7] "1" "2" "3" "4" ...
  ..$ var1 : chr [1:6] "1" "2" "3" "4" ...
  ..$ var4 : chr [1:15] "99" "A1" "C1" "D1" ...
  ..$ var6 : chr [1:4] "A" "B" "C" "Z"
  ..$ dummy: chr [1:2] "A" "B"
  ..$ var9 : chr [1:3] "A" "B" "Z"
  ..$ var8 : chr [1:7] "1" "2" "3" "4" ...
  ..$ var7 : chr [1:9] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "glm" "lm"

Upvotes: 1

Views: 666

Answers (1)

smci
smci

Reputation: 33970

You're saying 135x bloat after save() to disk. Here are some tips without seeing your data:

  1. object.size is only a shallow measure of memory usage, it doesn't follow pointers, hence strings and factors don't get counted. So, it can be severe undercount in some cases. Instead, use the excellent lsos() memory-reporting function and tell us what it reports.
  2. Does your environment contain lots of large strings (e.g. text or genomics), or high-cardinality factors with labels which got converted to string? or unnecessary string row-labels? Check you set options('stringsAsFactors'=F) and in read.csv(). Also make sure you never explicitly use dataframe(..., stringsAsFactors=T). (But none of that seems the case from your dump from str())
  3. Instead of just telling us object.size() or lsos() numbers, also tell us the total memory size of your R session both before and after creating those lr objects, before and after running garbage-collection (gc(),gc(reset=T)). Then after the save(), rm() the contents of logitFire) and then logitFire itself, report total memory usage by R, do garbage-collection, report total memory usage and how much the size shrank by.
  4. Check what other variables/dataframes/datatables/file objects/models are in your environment: do ls() and look at their sizes with lsos()) (as @BenBolker hints)
  5. Test it under plain R, not RStudio!, RStudio occasionally causes memory blowup or messes with the gc (e.g. if you accidentally keep references to the data in a View() window).
  6. Check both your ./.Rprofile and ~/.Rprofile for wack settings that may have crept in, even innocuous-looking ones like loading unnecessary packages or in a different load order (can shadow a builtin function). Did you repro this behavior from another clean environment/ VM/ coworker's login? If not, do.
  7. Make sure you start from a clean environment, don't import .RData, especially one littered with lots of large temporaries. Use R --no-restore --no-save, also in RStudio disable General settings 'Restore RData' and 'Save RData`. Delete any .RData you accidentally leave lying around.
  8. Some obvious stupid things to check about default behavior of save(..., ascii = FALSE, envir = parent.frame(), compress = isTRUE(!ascii), compression_level "integer: the level of compression to be used. Defaults to 6 for gzip compression and to 9 for bzip2 or xz compression") Check that nothing is messing with your defaults, also as @BenBolker says check what is in your environment (do ls() and look at their sizes).
  9. Fail all that, could always try a manual gzip on your .RData, to see how bloated it is.
  10. Along with all the above, try to cut down your commands and parameters to the minimum where the issue is reproducible. Use random-seeded data. Ideally we can get to a testcase :) Or a resolution.

Upvotes: 1

Related Questions