Reputation: 715
The example data:
library(data.table)
DT <- data.table(a = c(1, 3, 5, 9, 15),
b = c("a", "c", "d", "e", "f"))
I would like to get two rows which is a == 3 | a == 9
, that is
# a b
# 3 c
# 9 e
I know if I do: DT[, a:=as.character(a)]
then setkey(DT, a)
and DT[c("3", "9")]
, I can get the wanted result. But I would like to know, if there are other methods to do this kind of binary search without changing a
to character in advance?
Upvotes: 1
Views: 934
Reputation: 118799
First, you don't have to convert to a character column every time before to perform a join/binary search based subset. You can use J()
and pass an integer / numeric / character / logical / bit64::integer64 vector to it, like so:
DT[J(vec1, vec2, ...)]
where, vec1
will be matched against the first key column and vec2
against the second key column and so on.
The fact that you don't have to add a J()
for character types is an additional feature, just for convenience. Because a integer/numeric/logical vector already has a meaning as such - DT[1]
would return the first row, we can't provide the same shortcut for those types. Hope this answers your original question.
Coming back to your question, to subset column a
with values (3,9)
, you can do it using data.table
's binary search based subset:
require(data.table)
setkey(DT, a)
DT[J(c(3,9))] ## alternatively DT[.(c(3,9))] in 1.9.4+
# a b
# 1: 3 c
# 2: 9 e
There are two things to note, for you to fast subset using data.table
's binary search feature:
To address these issues and to provide better functionality, data.table addresses this problem in 1.9.4 by introducing a new experimental feature - automatic indexing, with the help of secondary keys. Matt has implemented this in 1.9.4.
What automatic indexing does is, if a secondary key doesn't already exist, on the first run of an expression that data.table understands (at the moment) can be optimised, a secondary key will be created. It just stores the order of this column using data.table's fast radix ordering, and stores it as an attribute. There's no reordering of the data at all, unlike setkey
. You can also set the secondary key using set2key()
.
The first time you run it, the time taken is equal to a) time for secondary key (usually very small), and b) time for the query. And from second time on, it's just the time to query, and that's fast using binary search.
If you query another column with an expression that data.table understands now, then it'll additionally set a secondary key for that column as well, the first time it's run. And so on...
There should be no (noticeable) differences in speed between the two methods (once setkey
and set2key
are done). See example below.
The concept of secondary keys can be extended beyond automatic indexing, to joins as well. This will speedup data.table joins even further.
Here's an example. I'll use 1.9.5, as Matt's already fixed some bugs in automatic indexing.
require(data.table) ## 1.9.5+
set.seed(45L)
DT = data.table(x=sample(1e3, 5e7, TRUE))[, y := x]
setkey(DT, x)
set2key(DT, y)
Note that after setkey(.)
DT
will be reordered, by reference. But set2key
would just set an attribute, and therefore your data wouldn't be reordered based on y's order.
The columns x and y are identical (on purpose). Let's test both:
system.time(DT[J(100L)]) ## on column x, 0.003 seconds
system.time(DT[y == 100L]) ## on column y, 0.003 seconds, uses secondary keys
identical(DT[J(100L)], DT[y==100L]) # [1] TRUE
How much time does it take with a vector scan?
options(datatable.auto.index = FALSE)
system.time(DT[y == 100L]) ## 0.428 seconds
Upvotes: 4
Reputation: 2677
You don't need to convert it into a character vector (although integer would make more sense)
DT <- data.table(a = c(1, 3, 5, 9, 15), b = c("a", "c", "d", "e", "f"))
setkey(DT, a)
DT[J(c(3, 9))]
Moreover, if you have the latest version in CRAN, the second time you use a in i will automatically uses binary search
Upvotes: 1