Reputation: 223
Using the data from Introduction to data.table and making some replications
library(data.table)
input <- if (file.exists("flights14.csv")) {
"flights14.csv"
} else {
"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
flights <- fread(input)
flights
dt <- rbindlist(replicate(n = 100, expr = flights, simplify = FALSE))
setkeyv(dt,'month')
Suppose that I want to filter the months 5, 6, 7 and 8. Then I can do it using the key
dt[.(c(5,6,7,8))]
or with between
dt[between(month,5,8)]
> all.equal(dt[between(month, 5, 8)], dt[.(c(5,6,7,8))])
[1] TRUE
I measure the system.time
using both ways and I got
> system.time(dt[.(c(5,6,7,8))])
user system elapsed
0.91 0.20 0.78
> system.time(dt[between(month, 5, 8), verbose = TRUE])
between parallel processing of integer took 0.036s
user system elapsed
0.70 0.17 0.51
I have two questions
i) Is there a way to filter ranged values directly using the key variables or I need to use between
passing the variable name?
ii) Why is the second approach using between
faster than the first using the key variable?
EDIT: About the times, when I pass multiple values as key
like in the first approach, internally is it looking for one value at time or all the values simultaneously? I ask it because testing the first approach with a single value it's about 1/4 of the time.
EDIT2: Using on
to subset
> system.time(dt[.(c(5,6,7,8)), on = 'month'])
user system elapsed
1.11 0.19 0.92
> all.equal(dt[.(c(5,6,7,8)),on = 'month'], dt[between(month, 5,8)][order(month)])
[1] TRUE
times:
a. using on
> microbenchmark(dt[.(c(5,6,7,8)), on = 'month'])
Unit: milliseconds
expr min lq mean median uq max neval
dt[.(c(5, 6, 7, 8)), on = "month"] 902.6143 928.9208 1107.647 1064.389 1218.027 1771.322 100
b. using between
> microbenchmark(dt[between(month, 5,8)])
Unit: milliseconds
expr min lq mean median uq max neval
dt[between(month, 5, 8)] 482.6147 499.077 601.2936 612.7822 670.9884 994 100
c. using month as key
> microbenchmark(dt[.(c(5,6,7,8))])
Unit: milliseconds
expr min lq mean median uq max neval
dt[.(c(5, 6, 7, 8))] 744.0166 756.106 858.1838 773.7258 936.9275 1241.372 100
Upvotes: 2
Views: 87
Reputation: 24742
To use key variable only, but use a range, just use seq
dt[.(seq(5,8))]
Unit: milliseconds
expr min lq mean median uq max neval
range_with_seq_on_key 486.1632 535.6169 643.5530 628.3633 715.7411 1113.851 100
range_with_between_on_month 387.0860 434.9599 557.6861 532.4342 630.5033 1039.224 100
Upvotes: 1