How to use key to filter with range and why it's slower than between directly?

Question

Using the data from Introduction to data.table and making some replications

library(data.table)
input <- if (file.exists("flights14.csv")) {
   "flights14.csv"
} else {
  "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
flights <- fread(input)
flights

dt <- rbindlist(replicate(n = 100, expr = flights, simplify = FALSE))
setkeyv(dt,'month')

Suppose that I want to filter the months 5, 6, 7 and 8. Then I can do it using the key

dt[.(c(5,6,7,8))]

or with between

dt[between(month,5,8)]

> all.equal(dt[between(month, 5, 8)], dt[.(c(5,6,7,8))])
[1] TRUE

I measure the system.timeusing both ways and I got

> system.time(dt[.(c(5,6,7,8))])
   user  system elapsed 
   0.91    0.20    0.78 
> system.time(dt[between(month, 5, 8), verbose = TRUE])
between parallel processing of integer took    0.036s
   user  system elapsed 
   0.70    0.17    0.51

I have two questions

i) Is there a way to filter ranged values directly using the key variables or I need to use between passing the variable name?

ii) Why is the second approach using betweenfaster than the first using the key variable?

EDIT: About the times, when I pass multiple values as key like in the first approach, internally is it looking for one value at time or all the values simultaneously? I ask it because testing the first approach with a single value it's about 1/4 of the time.

EDIT2: Using on to subset

> system.time(dt[.(c(5,6,7,8)), on = 'month'])
   user  system elapsed 
   1.11    0.19    0.92 
> all.equal(dt[.(c(5,6,7,8)),on = 'month'], dt[between(month, 5,8)][order(month)])
[1] TRUE

times:

a. using on

> microbenchmark(dt[.(c(5,6,7,8)), on = 'month'])
Unit: milliseconds
                               expr      min       lq     mean   median       uq      max neval
 dt[.(c(5, 6, 7, 8)), on = "month"] 902.6143 928.9208 1107.647 1064.389 1218.027 1771.322   100

b. using between

> microbenchmark(dt[between(month, 5,8)])
Unit: milliseconds
                     expr      min      lq     mean   median       uq max neval
 dt[between(month, 5, 8)] 482.6147 499.077 601.2936 612.7822 670.9884 994   100

c. using month as key

> microbenchmark(dt[.(c(5,6,7,8))])
Unit: milliseconds
                 expr      min      lq     mean   median       uq      max neval
 dt[.(c(5, 6, 7, 8))] 744.0166 756.106 858.1838 773.7258 936.9275 1241.372   100

How to use key to filter with range and why it's slower than between directly?

Answers (1)

Related Questions

How to use key to filter with range and why it&#39;s slower than between directly?

Answers (1)

Related Questions

How to use key to filter with range and why it's slower than between directly?