alexwhitworth
alexwhitworth

Reputation: 4907

Decreasing memory consumption in R -- pass by reference / data.table

I've already achieved a substantial speed up (~6.5x) by moving subsetting operations from base data.frame operations to data.table operations. But I'm wondering if I can get any improvement in memory.

My understanding is that R does not natively pass-by-reference (eg. see here). So, I'm seeking a method (short of re-writing a complex function in Rcpp) to do so. data.table provides some improvement [after editing my question to include typo caught by @joshua ulrich below]. But I'm looking for a larger improvement if possible.

In my actual use case, I'm doing simulation in parallel of numerous datasets with optimization via simulated annealing. I'd rather not re-write both simulated annealing and my loss function calculations in Rcpp due to the increased dev time and increased technical debt.

Example of problem:

What I'm largely concerned with is removing some subset of observations from a dataset and adding in another subset of observations. A very simple (nonsensical) example is given here. Is there a way to decrease memory usage? My current usage appears to pass-by-value and therefore memory usage (RAM) is roughly doubled.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:500) {

  Rprof("./DF_mem.out", memory.profiling = TRUE)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.self$mem.total[which(rownames(df$by.self) == "\"rbind\"")]


  Rprof("./DT_mem.out", memory.profiling = TRUE)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.self$mem.total[which(rownames(dt$by.self) == "\"rbind\"")]

}
pryr::object_size(df1)
80 MB
pryr::object_size(df2)
80 MB

# EDITED: via typo / fix from @Joshua Ulrich.
# improvement in memory usage via DT. still not pass-by-reference
quantile(df_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
379.00 428.60 440.10 447.70 455.36 459.20 466.48 469.89 474.40 482.10 512.60 
quantile(dt_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
 76.80  84.50  84.50  92.10  92.10  92.10  92.10 107.30 116.46 130.20 157.00 

Appendix:

### speed improvement:
#-----------------------------------------------
library(data.table)
library(microbenchmark)

set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

microbenchmark(
  df= {
    s1 <- sample(1:nrow(df1), size= 500, replace=F)
    s2 <- sample(1:nrow(df1), size= 500, replace=F)
    df1 <- rbind(df1[-s1,], df1[s2,])
  },
  dt= {
    s1 <- sample(1:nrow(df2), size= 500, replace=F)
    s2 <- sample(1:nrow(df2), size= 500, replace=F)
    df2 <- rbind(df2[-s1,], df2[s2,])

  }, times= 100L)

Unit: milliseconds
 expr      min        lq     mean   median       uq      max neval cld
   df 672.5106 757.65188 814.1582 809.6346 864.6668 998.2290   100   b
   dt  68.1254  85.73178 139.1256 120.3613 148.8243 397.7359   100  a 

Upvotes: 3

Views: 196

Answers (1)

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

prof_func has an error. It calls rbind on df1 instead of it's argument (df). Fix that, and you will see reduced memory usage with the data.table object.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:100) {
  Rprof("./DF_mem.out", memory.profiling = TRUE, interval=0.01)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.total["\"rbind\"","mem.total"]

  Rprof("./DT_mem.out", memory.profiling = TRUE, interval=0.01)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.total["\"rbind\"","mem.total"]
}
quantile(df_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0 413.4 432.5 455.0 485.9 
quantile(dt_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0  53.9  84.5 122.6 153.1 

Upvotes: 5

Related Questions