data.table column deletion speed

Question

Say I have a large table called 'data'. I want to delete the column referred to in a variable cI.

This is fast:

data = data[, eval(cI) := NULL]

This is slow:

data[, eval(cI) := NULL]

Both work (the 2nd usage doesn't print out (or return) the full table. What is going on under the hood to make the 2nd method slow? Obviously, a table copy is involved, but why?

The mystery deepens. I tried to measure system times, and there's a huge difference for the 2nd method depending on HOW I TIME IT:

> system.time(data <- data[, eval(dropI) := NULL])
   user  system elapsed
  0.004   0.000   0.003
> system.time(data[, eval(dropI) := NULL])
   user  system elapsed
  0.004   0.000   0.003
> date(); data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:31:51 2014"
[1] "Wed Jan 15 12:31:58 2014"
> date(); data <- data[, eval(dropI) := NULL]; date()
[1] "Wed Jan 15 12:32:26 2014"
[1] "Wed Jan 15 12:32:26 2014"`

Oh, and I have JIT compiler enabled (setting of 3)

Matt Dowle · Accepted Answer

To clear this up, @eddi and @joran were on the right track in the comments.

There is absolutely no speed difference between :

data = data[, eval(cI) := NULL]

and

data[, eval(cI) := NULL]

Because, both execute instantly, consistently (0.000s). The other answer is just timing copy(dt), see my edit there.

Btw, you don't need the eval, just the brackets alone are fine :

data = data[, (cI) := NULL]

or

data[, (cI) := NULL]

What's happening is that you're typing these commands on the console. Since the first is an assignment, the value data is returned invisibly and R doesn't print it. R does print the result of the second method.

Just like data.frame there is a huge difference in speed in typing DT vs print(DT) :

> DT          # very slow. R copies the whole of DT for some reason
> print(DT)   # very fast. R doesn't copy DT.

In the question you concluded a copy was being taken and you were right. But by printing, not by column deletion.

Maybe because DF prints the whole of DF, that's so slow anyway, nobody notices that R copied DF as well before it started to convert the entire DF to character form. Since DT by default prints the top and bottom of the table, which is very quick, you notice how long R takes to copy it. Something like this, anyway.

I don't know exactly why this is, but it's been known for a while. There are some copy reduction changes in the current development version of R and I'm hoping these will curtail the copy that's taken in what's known as automatic printing.

In the meantime, call print(DT) explicitly, for memory efficient printing!

data.table column deletion speed

Answers (2)

Related Questions