Reputation: 12875
There are lots of questions on reading multiple files, and memory management. I'm looking for information that addresses both of these issues together.
I often have to read multiple parts of data as separate files, rbind them into one dataset and then process it. I've been using something like below until now -
rbinideddataset <- do.call("rbind", lapply(list.files(), read.csv, header = TRUE))
I'm concerned about that bump that one can observe in each of the approaches. That probably is the instance that both the rbindeddataset and the not-yet-rbindeddatasets exist in memory together but I don't know enough to be sure. Can someone confirm this?
Is there some way I can extend the principle of pre-allocation to such a task? Or some other trick that anyone knows that might help in avoiding that bump? I also tried rbindlist
over the result of lapply
and that doesn't show the bump. Does that mean rbindlist
is smart enough to handle this?
data.table and Base R solutions preferred over some package's offerings.
EDIT ON 07-OCT-2013 based on the discussion with @Dwin and @mrip
> library(data.table)
> filenames <- list.files()
>
> #APPROACH 1 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
user system elapsed
44.60 1.11 45.98
>
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 350556 18.8 741108 39.6 715234 38.2
Vcells 1943837 14.9 153442940 1170.7 192055310 1465.3
>
> #APPROACH 2 #################################
> starttime <- proc.time()
> test <- lapply(filenames, read.csv, header = TRUE)
> test2 <- do.call("rbind", test)
> proc.time() - starttime
user system elapsed
47.09 1.26 50.70
>
> rm(test)
> rm(test2)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 350559 18.8 741108 39.6 715234 38.2
Vcells 1943849 14.9 157022756 1198.0 192055310 1465.3
>
>
> #APPROACH 3 #################################
> starttime <- proc.time()
> test <- lapply(filenames, read.csv, header = TRUE)
> test <- do.call("rbind", test)
> proc.time() - starttime
user system elapsed
48.61 1.93 51.16
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 350562 18.8 741108 39.6 715234 38.2
Vcells 1943861 14.9 152965559 1167.1 192055310 1465.3
>
>
> #APPROACH 4 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, fread))
> proc.time() - starttime
user system elapsed
12.87 0.09 12.95
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 351067 18.8 741108 39.6 715234 38.2
Vcells 1964791 15.0 122372447 933.7 192055310 1465.3
>
>
> #APPROACH 5 #################################
> starttime <- proc.time()
> test <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
user system elapsed
51.12 1.62 54.16
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 350568 18.8 741108 39.6 715234 38.2
Vcells 1943885 14.9 160270439 1222.8 192055310 1465.3
>
>
> #APPROACH 6 #################################
> starttime <- proc.time()
> test <- rbindlist(lapply(filenames, fread ))
> proc.time() - starttime
user system elapsed
13.62 0.06 14.60
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 351078 18.8 741108 39.6 715234 38.2
Vcells 1956397 15.0 128216351 978.3 192055310 1465.3
>
>
> #APPROACH 7 #################################
> starttime <- proc.time()
> test <- rbindlist(lapply(filenames, read.csv, header = TRUE))
> proc.time() - starttime
user system elapsed
48.44 0.83 51.70
> rm(test)
> rm(starttime)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 350620 18.8 741108 39.6 715234 38.2
Vcells 1944204 14.9 102573080 782.6 192055310 1465.3
As expected, time savings are highest with fread. However, approach 4,6, and 7 show minimum memory overhead and I'm not absolutely sure why.
Upvotes: 2
Views: 259
Reputation: 263342
Try this:
require(data.table)
system.time({
test3 <- do.call("rbind", lapply(filenames, fread, header = TRUE))
})
You mentioned pre-allocation. fread
does have an 'nrows' argument but it does not speed up its operation in the case where you know in advance the number of rows (because it counts the number of rows itself upfront automatically for you, which is very quick).
Upvotes: 2
Reputation: 15163
It looks like rbindlist
preallocates the memory and constructs the new data frame in one pass, whereas do.call(rbind)
will add one data frame at a time, copying it over each time. The result is that the rbind
method has a running time of O(n^2)
whereas rbindlist
runs in linear time. Also, rbindlist
should avoid the bump in memory, since it doesn't have to allocate a new data frame during each or the n
iterations.
Some experimental data:
x<-data.frame(matrix(1:10000,1000,10))
ls<-list()
for(i in 1:10000)
ls[[i]]<-x+i
rbindtime<-function(i){
gc()
system.time(do.call(rbind,ls[1:i]))[3]
}
rbindlisttime<-function(i){
gc()
system.time(data.frame(rbindlist(ls[1:i])))[3]
}
ii<-unique(floor(10*1.5^(1:15)))
## [1] 15 22 33 50 75 113 170 256 384 576 864 1297 1946 2919 4378
times<-Vectorize(rbindtime)(ii)
##elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.009 0.014 0.026 0.049 0.111 0.209 0.350 0.638 1.378 2.645
##elapsed elapsed elapsed elapsed elapsed
## 5.956 17.940 30.446 68.033 164.549
timeslist<-Vectorize(rbindlisttime)(ii)
##elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.001 0.001 0.001 0.002 0.002 0.003 0.004 0.008 0.009 0.015
##elapsed elapsed elapsed elapsed elapsed
## 0.023 0.031 0.046 0.099 0.249
Not only is rbindlist
much faster, especially for long inputs, but the running time only increases linearly, whereas do.call(rbind)
grows about quadratically. We can confirm this by fitting a log-log linear model to each set of times.
> lm(log(times) ~ log(ii))
Call:
lm(formula = log(times) ~ log(ii))
Coefficients:
(Intercept) log(ii)
-9.73 1.73
> lm(log(timeslist) ~ log(ii))
Call:
lm(formula = log(timeslist) ~ log(ii))
Coefficients:
(Intercept) log(ii)
-10.0550 0.9455
So, experimentally, the running time of do.call(rbind)
grows with n^1.73
whereas rbindlist
is about linear.
Upvotes: 4