Reputation: 285
I'd like to rbind multiple datatables in a memory efficient way.
More precisely, I'd like to rbind them one by one, and free memory on the go, so that I can join n data.tables of size k when my memory is only of size (n+1)*k.
I wrote this function hoping to do that :
rbindlistOneByOne <- function(l, use.names=FALSE, fill=FALSE, idcol=NULL, verbose = F) {
ll <- length(l)
# Handle empty lists
if(ll <= 0) stop("rbindlistOneByOne : empty list")
if(ll <= 1) return(l[[1]])
# Handle normal lists (ll > 2)
current <- l[[1]]
res <- current
l[1] <- NULL
rm(current); gc()
for(i in 2:ll) {
current <- l[[1]]
res <- rbindlist(list(res, current), use.names = use.names, fill = fill, idcol = idcol)
l[1] <- NULL
rm(current); gc()
}
return(res)
}
Now the problem is that this function is not memory efficient, even though I thought it would be.
Do you know why ? Is that because rm does not free memory, and that the data.table called "current" remains in memory ?
Upvotes: 3
Views: 1100
Reputation: 3223
There is no way to do what you want to do. Memory release is stochastic in R you can't control it. The use of gc()
may or may not release memory and it is not under user's control.
From http://adv-r.had.co.nz/memory.html :
Despite what you might have read elsewhere, there’s never any need to call gc() yourself. R will automatically run garbage collection whenever it needs more space; if you want to see when that is, call gcinfo(TRUE). The only reason you might want to call gc() is to ask R to return memory to the operating system. However, even that might not have any effect: older versions of Windows had no way for a program to return memory to the OS.
In addition calling gc
is extremely slow. Here a bechmark of your function with and without calling gc
for a list of 1000 tables of 10 lines
gc
: 8 msgc
: 7 srbindlist
is the most efficient way to bind data.table
Upvotes: 1