Reputation: 1011
Is there a specific method for combining a list of data.tables in R?
I have a list of ~20 data.tables, each with around 1 million rows, and would like to combine them into one data.table with 20 million rows.
I've been doing it with
Reduce('rbind', data.table)
but it takes a while.
Tnx!
Upvotes: 27
Views: 16480
Reputation: 42872
For my money, the plyr package's ldply
is the by way to do this. I has the advantage that the name of the list element is added as a new first column, named .id
.
In addition, a list of data frames is often the output of tapply
, in which case replace the whole shebang with ddply
.
Alternatives include do.call("rbind", mylist)
or lattice's make.groups
(haven't been able to find this one recently though).
Note: I may have misunderstood the question-I read data.frame
instead of data.table
. These techniques still work, but I'm not sure they result in a data.table
all of the time.
Upvotes: 2
Reputation: 69151
Using do.call
appears to be about 10x faster with this made up example:
library(data.table)
x1 <- data.table(x = runif(1e6), y = runif(1e6))
x2 <- data.table(x = runif(1e6), y = runif(1e6))
#20 data.tables all of length 1e6
yourList <- list(x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2)
system.time(out1 <- Reduce("rbind", yourList))
#-----
user system elapsed
3.37 3.03 6.43
system.time(out2 <- do.call("rbind", yourList))
#-----
user system elapsed
0.33 0.36 0.68
all.equal(out1,out2)
#-----
[1] TRUE
I did not realize data.table
had a specific function for this task. Par for the course, it is quite fast. Here is the relevant timing:
system.time(out3 <- rbindlist(yourList))
#-----
user system elapsed
0.07 0.03 0.11
all.equal(out1,out3)
#-----
[1] TRUE
Upvotes: 24
Reputation: 59602
See ?rbindlist
and these related questions (easier to find when you know what to search for!) :
data.table questions and answers containing rbindlist
Upvotes: 26