TheComeOnMan
TheComeOnMan

Reputation: 12875

Memory Management When Reading Multiple Files

There are lots of questions on reading multiple files, and memory management. I'm looking for information that addresses both of these issues together.

I often have to read multiple parts of data as separate files, rbind them into one dataset and then process it. I've been using something like below until now - rbinideddataset <- do.call("rbind", lapply(list.files(), read.csv, header = TRUE))

I'm concerned about that bump that one can observe in each of the approaches. That probably is the instance that both the rbindeddataset and the not-yet-rbindeddatasets exist in memory together but I don't know enough to be sure. Can someone confirm this?

Is there some way I can extend the principle of pre-allocation to such a task? Or some other trick that anyone knows that might help in avoiding that bump? I also tried rbindlist over the result of lapply and that doesn't show the bump. Does that mean rbindlist is smart enough to handle this?

data.table and Base R solutions preferred over some package's offerings.

EDIT ON 07-OCT-2013 based on the discussion with @Dwin and @mrip

> library(data.table)
> filenames <- list.files()
> 
> #APPROACH 1 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
   user  system elapsed 
  44.60    1.11   45.98 
> 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  350556 18.8     741108   39.6    715234   38.2
Vcells 1943837 14.9  153442940 1170.7 192055310 1465.3
> 
> #APPROACH 2 #################################
> starttime <- proc.time()
> test <- lapply(filenames, read.csv, header = TRUE)
> test2 <- do.call("rbind", test)
> proc.time() - starttime
   user  system elapsed 
  47.09    1.26   50.70 
> 
> rm(test)
> rm(test2)
> rm(starttime)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  350559 18.8     741108   39.6    715234   38.2
Vcells 1943849 14.9  157022756 1198.0 192055310 1465.3
> 
> 
> #APPROACH 3 #################################
> starttime <- proc.time()
> test <- lapply(filenames, read.csv, header = TRUE)
> test <- do.call("rbind", test)
> proc.time() - starttime
   user  system elapsed 
  48.61    1.93   51.16 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  350562 18.8     741108   39.6    715234   38.2
Vcells 1943861 14.9  152965559 1167.1 192055310 1465.3
> 
> 
> #APPROACH 4 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, fread))

> proc.time() - starttime
   user  system elapsed 
  12.87    0.09   12.95 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  351067 18.8     741108  39.6    715234   38.2
Vcells 1964791 15.0  122372447 933.7 192055310 1465.3
> 
> 
> #APPROACH 5 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
   user  system elapsed 
  51.12    1.62   54.16 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  350568 18.8     741108   39.6    715234   38.2
Vcells 1943885 14.9  160270439 1222.8 192055310 1465.3
> 
> 
> #APPROACH 6 #################################
> starttime <- proc.time()
> test <- rbindlist(lapply(filenames, fread ))

> proc.time() - starttime
   user  system elapsed 
  13.62    0.06   14.60 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  351078 18.8     741108  39.6    715234   38.2
Vcells 1956397 15.0  128216351 978.3 192055310 1465.3
> 
> 
> #APPROACH 7 #################################
> starttime <- proc.time()
> test <- rbindlist(lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
   user  system elapsed 
  48.44    0.83   51.70 
> rm(test)
> rm(starttime)
> gc()
          used (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells  350620 18.8     741108  39.6    715234   38.2
Vcells 1944204 14.9  102573080 782.6 192055310 1465.3

As expected, time savings are highest with fread. However, approach 4,6, and 7 show minimum memory overhead and I'm not absolutely sure why.

enter image description here

Upvotes: 2

Views: 259

Answers (2)

IRTFM
IRTFM

Reputation: 263342

Try this:

require(data.table)
system.time({
test3 <- do.call("rbind", lapply(filenames, fread, header = TRUE))
            })

You mentioned pre-allocation. fread does have an 'nrows' argument but it does not speed up its operation in the case where you know in advance the number of rows (because it counts the number of rows itself upfront automatically for you, which is very quick).

Upvotes: 2

mrip
mrip

Reputation: 15163

It looks like rbindlist preallocates the memory and constructs the new data frame in one pass, whereas do.call(rbind) will add one data frame at a time, copying it over each time. The result is that the rbind method has a running time of O(n^2) whereas rbindlist runs in linear time. Also, rbindlist should avoid the bump in memory, since it doesn't have to allocate a new data frame during each or the n iterations.

Some experimental data:

x<-data.frame(matrix(1:10000,1000,10))
ls<-list()
for(i in 1:10000)
  ls[[i]]<-x+i

rbindtime<-function(i){
  gc()
  system.time(do.call(rbind,ls[1:i]))[3]
}
rbindlisttime<-function(i){
  gc()
  system.time(data.frame(rbindlist(ls[1:i])))[3]
}

ii<-unique(floor(10*1.5^(1:15)))
## [1]   15   22   33   50   75  113  170  256  384  576  864 1297 1946 2919 4378

times<-Vectorize(rbindtime)(ii)
##elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 
##  0.009   0.014   0.026   0.049   0.111   0.209   0.350   0.638   1.378   2.645 
##elapsed elapsed elapsed elapsed elapsed 
##  5.956  17.940  30.446  68.033 164.549 

timeslist<-Vectorize(rbindlisttime)(ii)
##elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 
##  0.001   0.001   0.001   0.002   0.002   0.003   0.004   0.008   0.009   0.015 
##elapsed elapsed elapsed elapsed elapsed 
##  0.023   0.031   0.046   0.099   0.249 

Not only is rbindlist much faster, especially for long inputs, but the running time only increases linearly, whereas do.call(rbind) grows about quadratically. We can confirm this by fitting a log-log linear model to each set of times.

> lm(log(times) ~ log(ii))

Call:
lm(formula = log(times) ~ log(ii))

Coefficients:
(Intercept)      log(ii)  
      -9.73         1.73  

> lm(log(timeslist) ~ log(ii))

Call:
lm(formula = log(timeslist) ~ log(ii))

Coefficients:
(Intercept)      log(ii)  
   -10.0550       0.9455  

So, experimentally, the running time of do.call(rbind) grows with n^1.73 whereas rbindlist is about linear.

Upvotes: 4

Related Questions