rimorob
rimorob

Reputation: 654

data.table column deletion speed

Say I have a large table called 'data'. I want to delete the column referred to in a variable cI.

This is fast:

data = data[, eval(cI) := NULL]

This is slow:

data[, eval(cI) := NULL]

Both work (the 2nd usage doesn't print out (or return) the full table. What is going on under the hood to make the 2nd method slow? Obviously, a table copy is involved, but why?

The mystery deepens. I tried to measure system times, and there's a huge difference for the 2nd method depending on HOW I TIME IT:

> system.time(data <- data[, eval(dropI) := NULL])
   user  system elapsed
  0.004   0.000   0.003
> system.time(data[, eval(dropI) := NULL])
   user  system elapsed
  0.004   0.000   0.003
> date(); data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:31:51 2014"
[1] "Wed Jan 15 12:31:58 2014"
> date(); data <- data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:32:26 2014"
[1] "Wed Jan 15 12:32:26 2014"`

Oh, and I have JIT compiler enabled (setting of 3)

Upvotes: 3

Views: 163

Answers (2)

Beasterfield
Beasterfield

Reputation: 7113

There is no evidence that there is a difference in runtime:

set.seed(41)
dt <- data.table( a = rnorm(1000000), b = rnorm(1000000), c = rnorm(1000000) )

library( microbenchmark )
library( ggplot2 )

mb <- microbenchmark(  
  m1 = { x <- copy( dt ); x[ , c:= NULL ] },
  m2 = { x <- copy( dt ); x = x[ , c:= NULL ] },
  times = 500
)

# plot
qplot( data = mb, x = expr, y = time, geom = "boxplot", ylab="time [ns]", xlab = "approach" )

# show evidence
t.test( time ~ expr, data = mb )

Gives

    Welch Two Sample t-test

data:  time by expr
t = -0.3622, df = 972.022, p-value = 0.7173
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1437943.7   989846.5
sample estimates:
mean in group m1 mean in group m2 
        10068827         10292876 

and

[ EDIT from Matt ] These times include the time to copy(dt) which appears to be done so that the column can be repeatedly deleted. See how copy(dt) appears inside m1 and m2 definitions above. This is why the times vary so terribly and why even the best time is quite slow. In short, this benchmark appears to be flawed. If the copy(dt) is excluded from the benchmark, you should find the time to delete a column is consistently almost unmeasurable (i.e. 0.00s) for both methods m1 and m2. This answer is correct that there's no different between m1 and m2, but the graph should show a flat line at 0.00s once timing copy(dt) is isolated.

enter image description here

Upvotes: 3

Matt Dowle
Matt Dowle

Reputation: 59612

To clear this up, @eddi and @joran were on the right track in the comments.

There is absolutely no speed difference between :

data = data[, eval(cI) := NULL]

and

data[, eval(cI) := NULL]

Because, both execute instantly, consistently (0.000s). The other answer is just timing copy(dt), see my edit there.

Btw, you don't need the eval, just the brackets alone are fine :

data = data[, (cI) := NULL]

or

data[, (cI) := NULL]

What's happening is that you're typing these commands on the console. Since the first is an assignment, the value data is returned invisibly and R doesn't print it. R does print the result of the second method.

Just like data.frame there is a huge difference in speed in typing DT vs print(DT) :

> DT          # very slow. R copies the whole of DT for some reason
> print(DT)   # very fast. R doesn't copy DT.

In the question you concluded a copy was being taken and you were right. But by printing, not by column deletion.

Maybe because DF prints the whole of DF, that's so slow anyway, nobody notices that R copied DF as well before it started to convert the entire DF to character form. Since DT by default prints the top and bottom of the table, which is very quick, you notice how long R takes to copy it. Something like this, anyway.

I don't know exactly why this is, but it's been known for a while. There are some copy reduction changes in the current development version of R and I'm hoping these will curtail the copy that's taken in what's known as automatic printing.

In the meantime, call print(DT) explicitly, for memory efficient printing!

Upvotes: 4

Related Questions