Decreasing memory consumption in R -- pass by reference / data.table

Question

I've already achieved a substantial speed up (~6.5x) by moving subsetting operations from base data.frame operations to data.table operations. But I'm wondering if I can get any improvement in memory.

My understanding is that R does not natively pass-by-reference (eg. see here). So, I'm seeking a method (short of re-writing a complex function in Rcpp) to do so. data.table provides some improvement [after editing my question to include typo caught by @joshua ulrich below]. But I'm looking for a larger improvement if possible.

Another option is possibly the R.oo package, though I haven't yet found a good tutorial. (I still need to read this.
Would reference classes help at all?

In my actual use case, I'm doing simulation in parallel of numerous datasets with optimization via simulated annealing. I'd rather not re-write both simulated annealing and my loss function calculations in Rcpp due to the increased dev time and increased technical debt.

Example of problem:

What I'm largely concerned with is removing some subset of observations from a dataset and adding in another subset of observations. A very simple (nonsensical) example is given here. Is there a way to decrease memory usage? My current usage appears to pass-by-value and therefore memory usage (RAM) is roughly doubled.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:500) {

  Rprof("./DF_mem.out", memory.profiling = TRUE)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.self$mem.total[which(rownames(df$by.self) == "\"rbind\"")]


  Rprof("./DT_mem.out", memory.profiling = TRUE)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.self$mem.total[which(rownames(dt$by.self) == "\"rbind\"")]

}
pryr::object_size(df1)
80 MB
pryr::object_size(df2)
80 MB

# EDITED: via typo / fix from @Joshua Ulrich.
# improvement in memory usage via DT. still not pass-by-reference
quantile(df_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
379.00 428.60 440.10 447.70 455.36 459.20 466.48 469.89 474.40 482.10 512.60 
quantile(dt_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
 76.80  84.50  84.50  92.10  92.10  92.10  92.10 107.30 116.46 130.20 157.00

Appendix:

### speed improvement:
#-----------------------------------------------
library(data.table)
library(microbenchmark)

set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

microbenchmark(
  df= {
    s1 <- sample(1:nrow(df1), size= 500, replace=F)
    s2 <- sample(1:nrow(df1), size= 500, replace=F)
    df1 <- rbind(df1[-s1,], df1[s2,])
  },
  dt= {
    s1 <- sample(1:nrow(df2), size= 500, replace=F)
    s2 <- sample(1:nrow(df2), size= 500, replace=F)
    df2 <- rbind(df2[-s1,], df2[s2,])

  }, times= 100L)

Unit: milliseconds
 expr      min        lq     mean   median       uq      max neval cld
   df 672.5106 757.65188 814.1582 809.6346 864.6668 998.2290   100   b
   dt  68.1254  85.73178 139.1256 120.3613 148.8243 397.7359   100  a

Joshua Ulrich · Accepted Answer

prof_func has an error. It calls rbind on df1 instead of it's argument (df). Fix that, and you will see reduced memory usage with the data.table object.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:100) {
  Rprof("./DF_mem.out", memory.profiling = TRUE, interval=0.01)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.total["\"rbind\"","mem.total"]

  Rprof("./DT_mem.out", memory.profiling = TRUE, interval=0.01)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.total["\"rbind\"","mem.total"]
}
quantile(df_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0 413.4 432.5 455.0 485.9 
quantile(dt_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0  53.9  84.5 122.6 153.1

Decreasing memory consumption in R -- pass by reference / data.table

Example of problem:

Appendix:

Answers (1)

Related Questions