Reputation: 4907
I've already achieved a substantial speed up (~6.5x) by moving subsetting operations from base data.frame
operations to data.table
operations. But I'm wondering if I can get any improvement in memory.
My understanding is that R does not natively pass-by-reference (eg. see here). So, I'm seeking a method (short of re-writing a complex function in Rcpp
) to do so. data.table
provides some improvement [after editing my question to include typo caught by @joshua ulrich below]. But I'm looking for a larger improvement if possible.
In my actual use case, I'm doing simulation in parallel of numerous datasets with optimization via simulated annealing. I'd rather not re-write both simulated annealing and my loss function calculations in Rcpp due to the increased dev time and increased technical debt.
What I'm largely concerned with is removing some subset of observations from a dataset and adding in another subset of observations. A very simple (nonsensical) example is given here. Is there a way to decrease memory usage? My current usage appears to pass-by-value and therefore memory usage (RAM) is roughly doubled.
library(data.table)
set.seed(444L)
df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))
prof_func <- function(df) {
s1 <- sample(1:nrow(df), size= 500, replace=F)
s2 <- sample(1:nrow(df), size= 500, replace=F)
return(rbind(df[-s1,], df[s2,]))
}
dt_m <- df_m <- vector("numeric", length= 500L)
for (i in 1:500) {
Rprof("./DF_mem.out", memory.profiling = TRUE)
y <- prof_func(df1)
Rprof(NULL)
df <- summaryRprof("./DF_mem.out", memory= "both")
df_m[i] <- df$by.self$mem.total[which(rownames(df$by.self) == "\"rbind\"")]
Rprof("./DT_mem.out", memory.profiling = TRUE)
y2 <- prof_func(df2)
Rprof(NULL)
dt <- summaryRprof("./DT_mem.out", memory = "both")
dt_m[i] <- dt$by.self$mem.total[which(rownames(dt$by.self) == "\"rbind\"")]
}
pryr::object_size(df1)
80 MB
pryr::object_size(df2)
80 MB
# EDITED: via typo / fix from @Joshua Ulrich.
# improvement in memory usage via DT. still not pass-by-reference
quantile(df_m, seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
379.00 428.60 440.10 447.70 455.36 459.20 466.48 469.89 474.40 482.10 512.60
quantile(dt_m, seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
76.80 84.50 84.50 92.10 92.10 92.10 92.10 107.30 116.46 130.20 157.00
### speed improvement:
#-----------------------------------------------
library(data.table)
library(microbenchmark)
set.seed(444L)
df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))
microbenchmark(
df= {
s1 <- sample(1:nrow(df1), size= 500, replace=F)
s2 <- sample(1:nrow(df1), size= 500, replace=F)
df1 <- rbind(df1[-s1,], df1[s2,])
},
dt= {
s1 <- sample(1:nrow(df2), size= 500, replace=F)
s2 <- sample(1:nrow(df2), size= 500, replace=F)
df2 <- rbind(df2[-s1,], df2[s2,])
}, times= 100L)
Unit: milliseconds
expr min lq mean median uq max neval cld
df 672.5106 757.65188 814.1582 809.6346 864.6668 998.2290 100 b
dt 68.1254 85.73178 139.1256 120.3613 148.8243 397.7359 100 a
Upvotes: 3
Views: 196
Reputation: 176648
prof_func
has an error. It calls rbind
on df1
instead of it's argument (df
). Fix that, and you will see reduced memory usage with the data.table object.
library(data.table)
set.seed(444L)
df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))
prof_func <- function(df) {
s1 <- sample(1:nrow(df), size= 500, replace=F)
s2 <- sample(1:nrow(df), size= 500, replace=F)
return(rbind(df[-s1,], df[s2,]))
}
dt_m <- df_m <- vector("numeric", length= 500L)
for (i in 1:100) {
Rprof("./DF_mem.out", memory.profiling = TRUE, interval=0.01)
y <- prof_func(df1)
Rprof(NULL)
df <- summaryRprof("./DF_mem.out", memory= "both")
df_m[i] <- df$by.total["\"rbind\"","mem.total"]
Rprof("./DT_mem.out", memory.profiling = TRUE, interval=0.01)
y2 <- prof_func(df2)
Rprof(NULL)
dt <- summaryRprof("./DT_mem.out", memory = "both")
dt_m[i] <- dt$by.total["\"rbind\"","mem.total"]
}
quantile(df_m, seq(0,1,.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 0.0 0.0 0.0 0.0 0.0 0.0 0.0 413.4 432.5 455.0 485.9
quantile(dt_m, seq(0,1,.1))
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 0.0 0.0 0.0 0.0 0.0 0.0 0.0 53.9 84.5 122.6 153.1
Upvotes: 5