Reputation: 9805
I have a small data.table and using transform
with it takes forever. Here is a reproducible example:
library(data.table)
#data.table 1.8.8
set.seed(1)
dataraw <- data.table(sig1 = runif(80000, 0, 9999),
sig2 = runif(80000, 0, 9999),
sig3 = runif(80000, 0, 9999))
system.time(transform(dataraw, d = 1))
# user system elapsed
#16.345 0.016 16.359
dataraw2 <- as.data.frame(dataraw)
system.time(transform(dataraw2, d = 1))
# user system elapsed
#0.002 0.002 0.005
Why is transform
so slow with a data.table in comparison to when used with a data.frame?
Upvotes: 1
Views: 523
Reputation: 118799
o The slowness of
transform()
ondata.table
has been fixed,#2599
. But, please use:=
.
Although it's clear from the documentation and from ?transform.data.table
(from SenorO's post as well) that the idiomatic way is to use :=
(assign by reference), which is incredibly fast, I think it's still interesting to know why transform
is slower on data.table
. From what I've managed to comprehend so far, transform.data.table
is not always slower.
I'll make an attempt to answer that here. It doesn't seem to be a problem with transform.data.table
per-se, rather in its call to data.table()
function. By looking at data.table:::transform.data.table
, the lag comes from the line:
ans <- do.call("data.table", c(list(`_data`), e[!matched]))
So, let's benchmark this line with a big data.table
with values in order:
DT <- data.table(x=1:1e5, y=1:1e5, z=1:1e5)
system.time(do.call("data.table", c(list(DT), list(d=1))))
user system elapsed
0.003 0.003 0.026
Oh this is extremely fast! Let's benchmark the same, but with values not in order:
DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
system.time(do.call("data.table", c(list(DT), list(d=1))))
user system elapsed
7.986 0.016 8.099
# tested on 1.8.8 and 1.8.9
It gets slow. What's causing this difference? To do that we'll have to debug data.table()
function. By doing
DT <- data.table(x=as.numeric(1:1e5), y=as.numeric(1:1e5), z=as.numeric(1:1e5))
debugonce(data.table)
transform(DT, d=1)
and by hitting "enter" successively, you'll be able to find the reason for the such slowness is at the line:
exptxt = as.character(tt) # roughly about 7.2 seconds
It's clear that as.character
becomes the issue. Why? To do this, compare:
as.character(data.frame(x=1:10, y=1:10))
# [1] "1:10" "1:10"
as.character(data.frame(x=sample(10), y=sample(10)))
# [1] "c(9, 10, 4, 7, 6, 5, 1, 3, 8, 2)" "c(8, 5, 3, 7, 6, 10, 9, 1, 4, 2)"
Repeat this on bigger data to see that as.character
on sampled data.frame
gets slower.
Now then, the question becomes, why isn't
data.table(x = sample(1e5), y=sample(1e5))
time consuming? This is because, the input given to data.table()
function is substituted (with subsitute()
). In this case, tt
becomes:
$x
sample(1e+05)
$y
sample(1e+05)
and as.character(tt)
then just becomes:
# [1] "sample(1e+05)" "sample(1e+05)"
This means, if you were to do:
DT <- data.table(x = c(1,3,4,1,4,1,3,1,2...), y = c(1,1,4,1,3,4,1,1,3...))
I'd suppose that this'll take a LOT of time (which one doesn't usually do and hence no issues).
Upvotes: 13
Reputation: 17412
From ?transform.data.table
:
transform by group is particularly slow. Please use := by group instead.
within, transform and other similar functions in data.table are not just provided
for users who expect them to work, but for non-data.table-aware packages to
retain keys, for example. Hopefully the (much) faster and more convenient
data.table syntax will be used in time.
As @Roland suggests, you should always break down components of your code to find out what is actually taking up time/resources. In this case it is not log
, but transform
. Use :=
for data.tables, transform
for data.frames, lists, etc.
The culprit is not log
:
> dt <- data.table(A=1:1000000)
> system.time(transform(as.data.frame(dt), B=A * 1))
user system elapsed
0.00 0.02 0.01
> system.time(transform(dt, B=A * 1))
user system elapsed
14.61 0.00 14.61
Upvotes: 11