R. Cowboy
R. Cowboy

Reputation: 223

How to use key to filter with range and why it's slower than between directly?

Using the data from Introduction to data.table and making some replications

library(data.table)
input <- if (file.exists("flights14.csv")) {
   "flights14.csv"
} else {
  "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
flights <- fread(input)
flights

dt <- rbindlist(replicate(n = 100, expr = flights, simplify = FALSE))
setkeyv(dt,'month')

Suppose that I want to filter the months 5, 6, 7 and 8. Then I can do it using the key

dt[.(c(5,6,7,8))]

or with between

dt[between(month,5,8)]

> all.equal(dt[between(month, 5, 8)], dt[.(c(5,6,7,8))])
[1] TRUE

I measure the system.timeusing both ways and I got

> system.time(dt[.(c(5,6,7,8))])
   user  system elapsed 
   0.91    0.20    0.78 
> system.time(dt[between(month, 5, 8), verbose = TRUE])
between parallel processing of integer took    0.036s
   user  system elapsed 
   0.70    0.17    0.51 

I have two questions

i) Is there a way to filter ranged values directly using the key variables or I need to use between passing the variable name?

ii) Why is the second approach using betweenfaster than the first using the key variable?

EDIT: About the times, when I pass multiple values as key like in the first approach, internally is it looking for one value at time or all the values simultaneously? I ask it because testing the first approach with a single value it's about 1/4 of the time.

EDIT2: Using on to subset

> system.time(dt[.(c(5,6,7,8)), on = 'month'])
   user  system elapsed 
   1.11    0.19    0.92 
> all.equal(dt[.(c(5,6,7,8)),on = 'month'], dt[between(month, 5,8)][order(month)])
[1] TRUE

times:

a. using on

> microbenchmark(dt[.(c(5,6,7,8)), on = 'month'])
Unit: milliseconds
                               expr      min       lq     mean   median       uq      max neval
 dt[.(c(5, 6, 7, 8)), on = "month"] 902.6143 928.9208 1107.647 1064.389 1218.027 1771.322   100

b. using between

> microbenchmark(dt[between(month, 5,8)])
Unit: milliseconds
                     expr      min      lq     mean   median       uq max neval
 dt[between(month, 5, 8)] 482.6147 499.077 601.2936 612.7822 670.9884 994   100

c. using month as key

> microbenchmark(dt[.(c(5,6,7,8))])
Unit: milliseconds
                 expr      min      lq     mean   median       uq      max neval
 dt[.(c(5, 6, 7, 8))] 744.0166 756.106 858.1838 773.7258 936.9275 1241.372   100

Upvotes: 2

Views: 87

Answers (1)

langtang
langtang

Reputation: 24742

To use key variable only, but use a range, just use seq

dt[.(seq(5,8))]
Unit: milliseconds
                        expr      min       lq     mean   median       uq      max neval
       range_with_seq_on_key 486.1632 535.6169 643.5530 628.3633 715.7411 1113.851   100
 range_with_between_on_month 387.0860 434.9599 557.6861 532.4342 630.5033 1039.224   100

Upvotes: 1

Related Questions