Atheriel
Atheriel

Reputation: 61

Why is data.table slower than base R for these lookup table-style queries?

I'm trying to do some simple filtering on an existing data table object that I don't really want to modify. I noticed that it seems dramatically slower than the base R equivalent, as in the following example:

library(data.table)
dt <- data.table(
  label = sample(c("l1", "l2", "l3"), 300, replace = TRUE),
  x = rnorm(300)
)
df <- as.data.frame(dt)
all.equal(dt[label == "l3", x], df[df$label == "l3", "x"])

(bench <- microbenchmark::microbenchmark(
  dt_lookup = dt[label == "l3", x],
  df_lookup = df[df$label == "l3", "x"],
  mixed = dt[dt$label == "l3", "x"],
  times = 1000
))

which yields

Unit: microseconds
      expr    min     lq      mean median     uq    max neval
 dt_lookup 1159.2 1393.0 1529.4477 1451.6 1524.2 6487.9  1000
 df_lookup   17.3   25.2   33.8164   32.0   36.4  150.4  1000
     mixed  140.9  175.2  204.8512  193.9  220.7 1533.9  1000

That is, base R is more than 30 times faster.

Am I doing something wrong here? Indices and keys do not seem to have much effect on the performance in this case.

Upvotes: 2

Views: 542

Answers (2)

akrun
akrun

Reputation: 887108

Instead of subsetting the data.frame, the column can be extracted first with [[ and then subset the rows

df[["x"]][df[["label"]] == "l3"]

Upvotes: 4

JRR
JRR

Reputation: 3223

You can use a better data structure such as a list. In your example you don't need to go through the entire vector each time if the lookup table is hierarchically structured:

tb = list(l1 = rnorm(300),
          l2 = rnorm(300),
          l3 = rnorm(300))

tb[["l3"]] # accessed in 1 micro sec

Upvotes: 0

Related Questions