Reputation: 453
Can someone please help me evaluate at which size of a data frame using data.table is faster for searches? In my use case the data frames will be 24,000 rows and 560,000 rows. Blocks of 40 rows are always singled out for further use.
Example: DF is a data frame with 120 rows, 7 columns (x1 to x7); "string" occupies the first 40 rows of x1.
DF2 is 1000 times DF => 120,000 rows
For the size of DF data.table is slower, for the size of DF2 it is faster.
Code:
> DT <- data.table(DF)
> setkey(DT, x1)
>
> DT2 <- data.table(DF2)
> setkey(DT2, x1)
>
> microbenchmark(DF[DF$x1=="string", ], unit="us")
Unit: microseconds
expr min lq median uq max neval
DF[DF$x1 == "string", ] 282.578 290.8895 297.0005 304.5785 2394.09 100
> microbenchmark(DT[.("string")], unit="us")
Unit: microseconds
expr min lq median uq max neval
DT[.("string")] 1473.512 1500.889 1536.09 1709.89 6727.113 100
>
>
> microbenchmark(DF2[DF2$x1=="string", ], unit="us")
Unit: microseconds
expr min lq median uq max neval
DF2[DF2$x1 == "string", ] 31090.4 34694.74 35537.58 36567.18 61230.41 100
> microbenchmark(DT2[.("string")], unit="us")
Unit: microseconds
expr min lq median uq max neval
DT2[.("string")] 1327.334 1350.801 1391.134 1457.378 8440.668 100
Upvotes: 13
Views: 1683
Reputation: 132854
library(microbenchmark)
library(data.table)
timings <- sapply(1:10, function(n) {
DF <- data.frame(id=rep(as.character(seq_len(2^n)), each=40), val=rnorm(40*2^n), stringsAsFactors=FALSE)
DT <- data.table(DF, key="id")
tofind <- unique(DF$id)[n-1]
print(microbenchmark( DF[DF$id==tofind,],
DT[DT$id==tofind,],
DT[id==tofind],
`[.data.frame`(DT,DT$id==tofind,),
DT[tofind]), unit="ns")$median
})
matplot(1:10, log10(t(timings)), type="l", xlab="log2(n)", ylab="log10(median (ns))", lty=1)
legend("topleft", legend=c("DF[DF$id == tofind, ]",
"DT[DT$id == tofind, ]",
"DT[id == tofind]",
"`[.data.frame`(DT,DT$id==tofind,)",
"DT[tofind]"),
col=1:5, lty=1)
data.table_1.9.7
data.table
has made a few updates since this was written (a bit more overhead added to [.data.table
as a few more arguments / robustness checks have been built in, but also the introduction of auto-indexing). Here's an updated version as of the January 13, 2016 version of 1.9.7 from GitHub:
The main innovation is that the third option now leverages auto-indexing. The main conclusion remains the same -- if your table is of any nontrivial size (roughly larger than 500 observations), data.table
's within-frame calling is faster.
(notes about the updated plot: some minor things (un-logging the y-axis, expressing in microseconds, changing the x-axis labels, adding a title), but one non-trivial thing is I updated the microbenchmark
s to add some stability in the estimates--namely, I set the times
argument to as.integer(1e5/2^n)
)
Upvotes: 21