Reputation: 6489
I've declined to produce sample code because so far I have been unable to replicate this example on a smaller data set. I am training several logistic regressions (50 in this example) using different covariate selections and saving the output as a list. My training data has +400K rows.
Recognizing that there is a large amount of unnecessary background data that gets stored in glm objects, my training script involves the following lines of code, which are intended to strip out as much extra data as I can and reduce the memory footprint of the output object:
fit[c('residuals', 'fitted.values', 'effects', 'weights', 'prior.weights', 'y', 'linear.predictors', 'data')] <- NULL
fit$qr$qr <- NULL
gc()
At first this seemed like it worked OK. R/RStudio console tells me the list of glms is 9.6Mb after executing my code:
However, when I save this object using save(logitFire, file = 'logitFire.RData')
I find that it's memory footprint is absolutely massive, (1.32GBs on disk):
Again, I 100% recognize that it is bad form not to supply a toy example. I tried using the iris dataset and was unable to reproduce the problem. It seems to be a feature of large data sets, but I'm not sure. Any experts out there have an idea of what's going on? My next step, if I can't solve this using the functions in the base package, will be to write my own wrappers for "leanLogit" and "leanPredict" functions that just strip out the model covariates and toss all the rest of the ancillary data.
EDIT: To clarify, the sample script that I included in this question is embedded within the model training routine that ultimately produces the list of glms logitFire
. This is not complete code, but is embedded within a much larger script. I included it so that readers can see what data objects I'm stripping out.
EDIT #2: Here's some additional requested info. To be as clear as I possibly can, logitFire
is a list of 50 logistic regression models that I produced in R using glm. I've shown the output from the str
command on one element of this list, i.e. one logistic regression model:
> object.size(logitFire)
10113640 bytes
> str(logitFire[[1]])
List of 21
$ coefficients : Named num [1:54] 18.361 -0.592 -1.043 -0.744 0.101 ...
..- attr(*, "names")= chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
$ R : num [1:54, 1:54] -11.3 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
.. ..$ : chr [1:54] "(Intercept)" "var32" "var33" "var34" ...
$ rank : int 53
$ qr :List of 4
..$ rank : int 53
..$ qraux: num [1:54] 1 1 1 1 1 ...
..$ pivot: int [1:54] 1 2 3 4 5 6 7 8 9 10 ...
..$ tol : num 1e-11
..- attr(*, "class")= chr "qr"
$ family :List of 12
..$ family : chr "binomial"
..$ link : chr "logit"
..$ linkfun :function (mu)
..$ linkinv :function (eta)
..$ variance :function (mu)
..$ dev.resids:function (y, mu, wt)
..$ aic :function (y, n, mu, wt, dev)
..$ mu.eta :function (eta)
..$ initialize: expression({ if (NCOL(y) == 1) { if (is.factor(y)) y <- y != levels(y)[1L] n <- rep.int(1, nobs) y[weights == 0] <- 0 if (any(y < 0 | y > 1)) stop("y values must be 0 <= y <= 1") mustart <- (weights * y + 0.5)/(weights + 1) m <- weights * y if (any(abs(m - round(m)) > 0.001)) warning("non-integer #successes in a binomial glm!") } else if (NCOL(y) == 2) { if (any(abs(y - round(y)) > 0.001)) warning("non-integer counts in a binomial glm!") n <- y[, 1] + y[, 2] y <- ifelse(n == 0, 0, y[, 1]/n) weights <- weights * n mustart <- (n * y + 0.5)/(n + 1) } else stop("for the 'binomial' family, y must be a vector of 0 and 1's\nor a 2 column matrix where col 1 is no. successes and col 2 is no. failures") })
..$ validmu :function (mu)
..$ valideta :function (eta)
..$ simulate :function (object, nsim)
..- attr(*, "class")= chr "family"
$ deviance : num 1648
$ aic : num 1754
$ null.deviance: num 1783
$ iter : int 19
$ df.residual : int 49947
$ df.null : int 49999
$ converged : logi TRUE
$ boundary : logi FALSE
$ call : language glm(formula = fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA, family = binomial(), data = inData, model = FALSE)
$ formula :Class 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
.. ..- attr(*, ".Environment")=<environment: 0x7f98f5685ce8>
$ terms :Classes 'terms', 'formula' length 3 fire ~ var3 + var1 + var12isNA + var4 + var11 + var13 + var6 + dummy + var9 + var16isNA + var10 + var17 + var8 + var7 + var15isNA + var14isNA
.. ..- attr(*, "variables")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
.. ..- attr(*, "factors")= int [1:17, 1:16] 0 1 0 0 0 0 0 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:17] "fire" "var3" "var1" "var12isNA" ...
.. .. .. ..$ : chr [1:16] "var3" "var1" "var12isNA" "var4" ...
.. ..- attr(*, "term.labels")= chr [1:16] "var3" "var1" "var12isNA" "var4" ...
.. ..- attr(*, "order")= int [1:16] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x7f98f5685ce8>
.. ..- attr(*, "predvars")= language list(fire, var3, var1, var12isNA, var4, var11, var13, var6, dummy, var9, var16isNA, var10, var17, var8, var7, var15isNA, var14isNA)
.. ..- attr(*, "dataClasses")= Named chr [1:17] "numeric" "factor" "factor" "logical" ...
.. .. ..- attr(*, "names")= chr [1:17] "fire" "var3" "var1" "var12isNA" ...
$ offset : NULL
$ control :List of 3
..$ epsilon: num 1e-08
..$ maxit : num 25
..$ trace : logi FALSE
$ method : chr "glm.fit"
$ contrasts :List of 12
..$ var3 : chr "contr.treatment"
..$ var1 : chr "contr.treatment"
..$ var12isNA: chr "contr.treatment"
..$ var4 : chr "contr.treatment"
..$ var6 : chr "contr.treatment"
..$ dummy : chr "contr.treatment"
..$ var9 : chr "contr.treatment"
..$ var16isNA: chr "contr.treatment"
..$ var8 : chr "contr.treatment"
..$ var7 : chr "contr.treatment"
..$ var15isNA: chr "contr.treatment"
..$ var14isNA: chr "contr.treatment"
$ xlevels :List of 8
..$ var3 : chr [1:7] "1" "2" "3" "4" ...
..$ var1 : chr [1:6] "1" "2" "3" "4" ...
..$ var4 : chr [1:15] "99" "A1" "C1" "D1" ...
..$ var6 : chr [1:4] "A" "B" "C" "Z"
..$ dummy: chr [1:2] "A" "B"
..$ var9 : chr [1:3] "A" "B" "Z"
..$ var8 : chr [1:7] "1" "2" "3" "4" ...
..$ var7 : chr [1:9] "1" "2" "3" "4" ...
- attr(*, "class")= chr [1:2] "glm" "lm"
Upvotes: 1
Views: 666
Reputation: 33970
You're saying 135x bloat after save()
to disk. Here are some tips without seeing your data:
object.size
is only a shallow measure of memory usage, it doesn't follow pointers, hence strings and factors don't get counted. So, it can be severe undercount in some cases. Instead, use the excellent lsos()
memory-reporting function and tell us what it reports.options('stringsAsFactors'=F)
and in read.csv()
. Also make sure you never explicitly use dataframe(..., stringsAsFactors=T)
. (But none of that seems the case from your dump from str()
)object.size()
or lsos()
numbers, also tell us the total memory size of your R session both before and after creating those lr objects, before and after running garbage-collection (gc()
,gc(reset=T)
). Then after the save()
, rm() the contents of logitFire) and then logitFire itself, report total memory usage by R, do garbage-collection, report total memory usage and how much the size shrank by.View()
window)../.Rprofile
and ~/.Rprofile
for wack settings that may have crept in, even innocuous-looking ones like loading unnecessary packages or in a different load order (can shadow a builtin function). Did you repro this behavior from another clean environment/ VM/ coworker's login? If not, do.R --no-restore --no-save
, also in RStudio disable General settings 'Restore RData' and 'Save RData`. Delete any .RData you accidentally leave lying around.save(..., ascii = FALSE, envir = parent.frame(), compress = isTRUE(!ascii), compression_level "integer: the level of compression to be used. Defaults to 6 for gzip compression and to 9 for bzip2 or xz compression")
Check that nothing is messing with your defaults, also as @BenBolker says check what is in your environment (do ls()
and look at their sizes).Upvotes: 1