Reputation: 32426
This seems like such an obvious question but I feel like I am doing it wrong. I have a vector of strings, and I just want to find the matching row indices in a data.table. The data.table is keyed by the column I want to match against, so, I think I should be able to use a binary search to find the matching indices.
Example to follow:
Here, I have a data.table, keyed by column c2
and a vector of strings, new_dat
, for which I would like to find the row indices.
library(data.table)
## Example data, a keyed data.table
dat <- data.table(c1=1:10, c2=letters[1:10], key='c2')
## Match at some indices (keyed column so should be binary search?)
new_dat <- c('d', 'j')
## This doesn't feel right -- I don't think this is taking advantage of the
## data.table ordering at all
## Tried some dumb stuff like dat[match(new_dat, c2, 0L), .I]
dat[match(new_dat, c2, 0L), ] # only want the index of the matches
# c1 c2
# 1: 4 d
# 2: 10 j
## So, this is the desired result,
## but this is just doing ordinary linear search (I say w/o actually looking at the code)
match(new_dat, dat[['c2']], 0L)
# [1] 4 10
I just realized I could do,
dat[, ind := 1:.N][match(new_dat, c2, 0L), ind]
To get the indices, but still doesn't solve the problem I was trying to portray.
Upvotes: 4
Views: 590
Reputation: 92292
In order to find row indices (while not grouping using the by
argument) you could specify which = TRUE
while performing the binary join
options(datatable.verbose = TRUE) # Setting to TRUE so we can see the binary join being triggered
dat[new_dat, which = TRUE]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
# [1] 4 10
(This will also work without creating c1
because it doesn't use that column at all)
And, if you just want to perform a normal binary join and see all the values in all the columns, you don't need to use match
or create an index, just do
dat[new_dat]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
# c1 c2
# 1: 4 d
# 2: 10 j
In general, as long as your data is keyed (or you using the on
argument in order to join) and new_dat
is not of class integer
or numeric
, data.table
will automatically trigger the binary join even if new_dat
is passed to the i
th argument without being wrapped into .
or J
. Though, if new_dat
is one of the above mentioned classes, data.table
will try to perform a row indexing instead. Thus, you will need to use dat[.(new_dat), which = TRUE]
or dat[J(new_dat), which = TRUE]
in order to force bmerge
instead row indexing.
Upvotes: 7