Reputation: 155

Why is a for loop on data.table row index slower than on data.frame?

I am definitely confused on why accessing a data.table by row index is slower than data.frame. Any suggestions how i can access each row of data.table sequentially in loop that is faster?

m = matrix(1L, nrow=100000, ncol=100)

DF = as.data.frame(m)
DT = as.data.table(m)

identical(DF[100, ], DT[100, ])
[1] FALSE

> all(DF[100, ], DT[100, ])
[1] TRUE

> system.time(for (i in 1:1000) DT[i,])
   user  system elapsed 
  5.440   0.000   5.451 

R> system.time(for (i in 1:1000) DF[i,])
   user  system elapsed 
  2.757   0.000   2.784

Upvotes: 6

Answers (1)

BrodieG

Reputation: 52637

A data.table query has more arguments (and it does more) so the small overhead of DT[...] is larger than DF[...]. This overhead adds up if you loop it. The intended use of data.table is to have it execute a large complex operation few times, rather than small trivial calculations multiple times. So let's reformulate your test:

> system.time(DT[seq(len=nrow(m)),])
 user  system elapsed 
0.08    0.02    0.09 
> system.time(DF[seq(len=nrow(m)),])
 user  system elapsed 
0.08    0.05    0.13

Here, they are about the same. Since we only have one DT call, the overhead isn't that apparent because the overhead is only executed once. In your case you executed it 100K times (unnecessarily, I might add). If you are using data.table and you are making calls to it thousands of times, you are probably using it wrong. There almost certainly is a way to reformulate so you can have just one or a few data.table calls that do the same thing.

Also, note that even my reformulated test here is pretty trivial, which is why data.table performs comparably to data.frame.

Upvotes: 7

Why is a for loop on data.table row index slower than on data.frame?

Answers (1)

Related Questions