Performance of data.table

Question

I always assumed that data.table provided the best performance on data access.

However, I came across the following results when I benchmark the following 2 statements.

app_sig_reg[which(app_sig_reg$input == proj$country),]$value
app_sig_reg[input == proj$country,value]

where app_sig_reg is a data.table object.

This is the results I get when I run microbenchmark library to measure their performance.

microbenchmark(
  app_sig_reg[which(app_sig_reg$input == proj$country),]$value,
  app_sig_reg[input == proj$country,value]
)

Unit: microseconds
                                                          expr   min     lq     mean  median      uq    max neval
 app_sig_reg[which(app_sig_reg$input == proj$country), ]$value 118.5 132.05  165.932  146.55  163.70  489.1   100
                     app_sig_reg[input == proj$country, value] 967.3 993.85 1098.607 1028.05 1123.35 1752.6   100

My assumption was that app_sig_reg[input == proj$country,value] would execute faster, but the results indicate the opposite.

I would appreciate any insight on this.

Rui Barradas · Accepted Answer

The question is not completely clear on what to match. If it's only one country, then the results below show that speed depends on

which versus equal;
the extractors, the methods $ versus [ for objects of class "data.table".

If instead of equality tests for one element (country) the tests are for many with %in% the results may vary.

library(data.table)
library(microbenchmark)
library(ggplot2)

set.seed(2022)
app_sig_reg <- data.table(
  input = sample(letters, 100, TRUE),
  value = runif(100)
)
proj <- data.table(country = sample(letters, 1))


testFun <- function(X, n){
  out <- lapply(seq.int(n), \(k){
    Y <- X
    for(i in seq.int(k)) Y <- rbind(Y, Y)
    mb <- microbenchmark(
      `which$` = Y[which(Y$input == proj$country), ]$value,
      `which[` = Y[which(input == proj$country), value],
      `equal$` = Y[input == proj$country,]$value,
      `equal[` = Y[input == proj$country,value]
    )
    agg <- aggregate(time ~ expr, mb, median)
    agg$nrow <- nrow(Y)
    agg
  })
  do.call(rbind, out)
}

res <- testFun(app_sig_reg, 15)

ggplot(res, aes(nrow, time, color = expr)) +
  geom_line() +
  geom_point() +
  scale_color_manual(values = c(`which$` = "red", `equal$` = "orangered", `which[` = "blue", `equal[` = "skyblue")) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  theme_bw()

^{Created on 2022-02-20 by the reprex package (v2.0.1)}

Performance of data.table

Answers (1)

Related Questions