Reputation: 295
I always assumed that data.table provided the best performance on data access.
However, I came across the following results when I benchmark the following 2 statements.
app_sig_reg[which(app_sig_reg$input == proj$country),]$value
app_sig_reg[input == proj$country,value]
where app_sig_reg
is a data.table
object.
This is the results I get when I run microbenchmark
library to measure their performance.
microbenchmark(
app_sig_reg[which(app_sig_reg$input == proj$country),]$value,
app_sig_reg[input == proj$country,value]
)
Unit: microseconds
expr min lq mean median uq max neval
app_sig_reg[which(app_sig_reg$input == proj$country), ]$value 118.5 132.05 165.932 146.55 163.70 489.1 100
app_sig_reg[input == proj$country, value] 967.3 993.85 1098.607 1028.05 1123.35 1752.6 100
My assumption was that app_sig_reg[input == proj$country,value]
would execute faster, but the results indicate the opposite.
I would appreciate any insight on this.
Upvotes: 0
Views: 264
Reputation: 76495
The question is not completely clear on what to match. If it's only one country
, then the results below show that speed depends on
which
versus equal
;$
versus [
for objects of class "data.table"
.If instead of equality tests for one element (country
) the tests are for many with %in%
the results may vary.
library(data.table)
library(microbenchmark)
library(ggplot2)
set.seed(2022)
app_sig_reg <- data.table(
input = sample(letters, 100, TRUE),
value = runif(100)
)
proj <- data.table(country = sample(letters, 1))
testFun <- function(X, n){
out <- lapply(seq.int(n), \(k){
Y <- X
for(i in seq.int(k)) Y <- rbind(Y, Y)
mb <- microbenchmark(
`which$` = Y[which(Y$input == proj$country), ]$value,
`which[` = Y[which(input == proj$country), value],
`equal$` = Y[input == proj$country,]$value,
`equal[` = Y[input == proj$country,value]
)
agg <- aggregate(time ~ expr, mb, median)
agg$nrow <- nrow(Y)
agg
})
do.call(rbind, out)
}
res <- testFun(app_sig_reg, 15)
ggplot(res, aes(nrow, time, color = expr)) +
geom_line() +
geom_point() +
scale_color_manual(values = c(`which$` = "red", `equal$` = "orangered", `which[` = "blue", `equal[` = "skyblue")) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10") +
theme_bw()
Created on 2022-02-20 by the reprex package (v2.0.1)
Upvotes: 3