Reputation: 61
I'm trying to do some simple filtering on an existing data table object that I don't really want to modify. I noticed that it seems dramatically slower than the base R equivalent, as in the following example:
library(data.table)
dt <- data.table(
label = sample(c("l1", "l2", "l3"), 300, replace = TRUE),
x = rnorm(300)
)
df <- as.data.frame(dt)
all.equal(dt[label == "l3", x], df[df$label == "l3", "x"])
(bench <- microbenchmark::microbenchmark(
dt_lookup = dt[label == "l3", x],
df_lookup = df[df$label == "l3", "x"],
mixed = dt[dt$label == "l3", "x"],
times = 1000
))
which yields
Unit: microseconds
expr min lq mean median uq max neval
dt_lookup 1159.2 1393.0 1529.4477 1451.6 1524.2 6487.9 1000
df_lookup 17.3 25.2 33.8164 32.0 36.4 150.4 1000
mixed 140.9 175.2 204.8512 193.9 220.7 1533.9 1000
That is, base R is more than 30 times faster.
Am I doing something wrong here? Indices and keys do not seem to have much effect on the performance in this case.
Upvotes: 2
Views: 542
Reputation: 887108
Instead of subsetting the data.frame, the column can be extracted first with [[
and then subset the rows
df[["x"]][df[["label"]] == "l3"]
Upvotes: 4
Reputation: 3223
You can use a better data structure such as a list
. In your example you don't need to go through the entire vector each time if the lookup table is hierarchically structured:
tb = list(l1 = rnorm(300),
l2 = rnorm(300),
l3 = rnorm(300))
tb[["l3"]] # accessed in 1 micro sec
Upvotes: 0