Why is transform.data.table so much slower than transform.data.frame?

Question

I have a small data.table and using transform with it takes forever. Here is a reproducible example:

library(data.table)
#data.table 1.8.8
set.seed(1) 

dataraw <- data.table(sig1 = runif(80000, 0, 9999),
                      sig2 = runif(80000, 0, 9999),
                      sig3 = runif(80000, 0, 9999))

system.time(transform(dataraw, d = 1))
#  user      system     elapsed 
#16.345       0.016      16.359 

dataraw2 <- as.data.frame(dataraw)

system.time(transform(dataraw2, d = 1))
# user      system     elapsed 
#0.002       0.002       0.005

Why is transform so slow with a data.table in comparison to when used with a data.frame?

Arun · Accepted Answer

Update: This has been fixed long back, in v1.8.10. From NEWS:

o The slowness of transform() on data.table has been fixed, #2599. But, please use :=.

Although it's clear from the documentation and from ?transform.data.table (from SenorO's post as well) that the idiomatic way is to use := (assign by reference), which is incredibly fast, I think it's still interesting to know why transform is slower on data.table. From what I've managed to comprehend so far, transform.data.table is not always slower.

I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table per-se, rather in its call to data.table() function. By looking at data.table:::transform.data.table, the lag comes from the line:

ans <- do.call("data.table", c(list(`_data`), e[!matched]))

So, let's benchmark this line with a big data.table with values in order:

DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
   user  system elapsed 
  0.003   0.003   0.026

Oh this is extremely fast! Let's benchmark the same, but with values not in order:

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))

   user  system elapsed 
  7.986   0.016   8.099 

# tested on 1.8.8 and 1.8.9

It gets slow. What's causing this difference? To do that we'll have to debug data.table() function. By doing

DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)

and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:

exptxt = as.character(tt) # roughly about 7.2 seconds

It's clear that as.character becomes the issue. Why? To do this, compare:

as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"

as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"

Repeat this on bigger data to see that as.character on sampled data.frame gets slower.

Now then, the question becomes, why isn't

data.table(x = sample(1e5), y=sample(1e5))

time consuming? This is because, the input given to data.table() function is substituted (with subsitute()). In this case, tt becomes:

$x
sample(1e+05)

$y
sample(1e+05)

and as.character(tt) then just becomes:

# [1] "sample(1e+05)" "sample(1e+05)"

This means, if you were to do:

DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))

I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).

Why is transform.data.table so much slower than transform.data.frame?

Answers (2)

Update: This has been fixed long back, in v1.8.10. From NEWS:

Related Questions