Reputation: 654
Say I have a large table called 'data'. I want to delete the column referred to in a variable cI.
This is fast:
data = data[, eval(cI) := NULL]
This is slow:
data[, eval(cI) := NULL]
Both work (the 2nd usage doesn't print out (or return) the full table. What is going on under the hood to make the 2nd method slow? Obviously, a table copy is involved, but why?
The mystery deepens. I tried to measure system times, and there's a huge difference for the 2nd method depending on HOW I TIME IT:
> system.time(data <- data[, eval(dropI) := NULL])
user system elapsed
0.004 0.000 0.003
> system.time(data[, eval(dropI) := NULL])
user system elapsed
0.004 0.000 0.003
> date(); data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:31:51 2014"
[1] "Wed Jan 15 12:31:58 2014"
> date(); data <- data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:32:26 2014"
[1] "Wed Jan 15 12:32:26 2014"`
Oh, and I have JIT compiler enabled (setting of 3)
Upvotes: 3
Views: 163
Reputation: 7113
There is no evidence that there is a difference in runtime:
set.seed(41)
dt <- data.table( a = rnorm(1000000), b = rnorm(1000000), c = rnorm(1000000) )
library( microbenchmark )
library( ggplot2 )
mb <- microbenchmark(
m1 = { x <- copy( dt ); x[ , c:= NULL ] },
m2 = { x <- copy( dt ); x = x[ , c:= NULL ] },
times = 500
)
# plot
qplot( data = mb, x = expr, y = time, geom = "boxplot", ylab="time [ns]", xlab = "approach" )
# show evidence
t.test( time ~ expr, data = mb )
Gives
Welch Two Sample t-test
data: time by expr
t = -0.3622, df = 972.022, p-value = 0.7173
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1437943.7 989846.5
sample estimates:
mean in group m1 mean in group m2
10068827 10292876
and
[ EDIT from Matt ] These times include the time to copy(dt)
which appears to be done so that the column can be repeatedly deleted. See how copy(dt)
appears inside m1
and m2
definitions above. This is why the times vary so terribly and why even the best time is quite slow. In short, this benchmark appears to be flawed. If the copy(dt)
is excluded from the benchmark, you should find the time to delete a column is consistently almost unmeasurable (i.e. 0.00s) for both methods m1
and m2
. This answer is correct that there's no different between m1
and m2
, but the graph should show a flat line at 0.00s once timing copy(dt)
is isolated.
Upvotes: 3
Reputation: 59612
To clear this up, @eddi and @joran were on the right track in the comments.
There is absolutely no speed difference between :
data = data[, eval(cI) := NULL]
and
data[, eval(cI) := NULL]
Because, both execute instantly, consistently (0.000s). The other answer is just timing copy(dt)
, see my edit there.
Btw, you don't need the eval
, just the brackets alone are fine :
data = data[, (cI) := NULL]
or
data[, (cI) := NULL]
What's happening is that you're typing these commands on the console. Since the first is an assignment, the value data
is returned invisibly and R doesn't print it. R does print the result of the second method.
Just like data.frame
there is a huge difference in speed in typing DT
vs print(DT)
:
> DT # very slow. R copies the whole of DT for some reason
> print(DT) # very fast. R doesn't copy DT.
In the question you concluded a copy was being taken and you were right. But by printing, not by column deletion.
Maybe because DF
prints the whole of DF
, that's so slow anyway, nobody notices that R copied DF
as well before it started to convert the entire DF
to character form. Since DT
by default prints the top and bottom of the table, which is very quick, you notice how long R takes to copy it. Something like this, anyway.
I don't know exactly why this is, but it's been known for a while. There are some copy reduction changes in the current development version of R and I'm hoping these will curtail the copy that's taken in what's known as automatic printing.
In the meantime, call print(DT)
explicitly, for memory efficient printing!
Upvotes: 4