nicolas
nicolas

Reputation: 9805

Why is transform.data.table so much slower than transform.data.frame?

I have a small data.table and using transform with it takes forever. Here is a reproducible example:

library(data.table)
#data.table 1.8.8
set.seed(1) 

dataraw <- data.table(sig1 = runif(80000, 0, 9999),
                      sig2 = runif(80000, 0, 9999),
                      sig3 = runif(80000, 0, 9999))

system.time(transform(dataraw, d = 1))
#  user      system     elapsed 
#16.345       0.016      16.359 

dataraw2 <- as.data.frame(dataraw)

system.time(transform(dataraw2, d = 1))
# user      system     elapsed 
#0.002       0.002       0.005 

Why is transform so slow with a data.table in comparison to when used with a data.frame?

Upvotes: 1

Views: 523

Answers (2)

Arun
Arun

Reputation: 118799

Update: This has been fixed long back, in v1.8.10. From NEWS:

o The slowness of transform() on data.table has been fixed, #2599. But, please use :=.


Although it's clear from the documentation and from ?transform.data.table (from SenorO's post as well) that the idiomatic way is to use := (assign by reference), which is incredibly fast, I think it's still interesting to know why transform is slower on data.table. From what I've managed to comprehend so far, transform.data.table is not always slower.

I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table per-se, rather in its call to data.table() function. By looking at data.table:::transform.data.table, the lag comes from the line:

ans <- do.call("data.table", c(list(`_data`), e[!matched]))

So, let's benchmark this line with a big data.table with values in order:

DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
   user  system elapsed 
  0.003   0.003   0.026 

Oh this is extremely fast! Let's benchmark the same, but with values not in order:

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))

   user  system elapsed 
  7.986   0.016   8.099 

# tested on 1.8.8 and 1.8.9

It gets slow. What's causing this difference? To do that we'll have to debug data.table() function. By doing

DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)

and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:

exptxt = as.character(tt) # roughly about 7.2 seconds

It's clear that as.character becomes the issue. Why? To do this, compare:

as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"

as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"

Repeat this on bigger data to see that as.character on sampled data.frame gets slower.

Now then, the question becomes, why isn't

data.table(x = sample(1e5), y=sample(1e5))

time consuming? This is because, the input given to data.table() function is substituted (with subsitute()). In this case, tt becomes:

$x
sample(1e+05)

$y
sample(1e+05)

and as.character(tt) then just becomes:

# [1] "sample(1e+05)" "sample(1e+05)"

This means, if you were to do:

DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))

I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).

Upvotes: 13

Se&#241;or O
Se&#241;or O

Reputation: 17412

From ?transform.data.table:

transform by group is particularly slow. Please use := by group instead.

within, transform and other similar functions in data.table are not just provided 
for users who expect them to work, but for non-data.table-aware packages to 
retain keys, for example. Hopefully the (much) faster and more convenient 
data.table syntax will be used in time. 

As @Roland suggests, you should always break down components of your code to find out what is actually taking up time/resources. In this case it is not log, but transform. Use := for data.tables, transform for data.frames, lists, etc.

The culprit is not log:

> dt <- data.table(A=1:1000000)
> system.time(transform(as.data.frame(dt), B=A * 1))
   user  system elapsed 
   0.00    0.02    0.01 
> system.time(transform(dt, B=A * 1))
   user  system elapsed 
  14.61    0.00   14.61 

Upvotes: 11

Related Questions