Rorschach
Rorschach

Reputation: 32426

Find row indices of matches in data.table (can I binary search?)

This seems like such an obvious question but I feel like I am doing it wrong. I have a vector of strings, and I just want to find the matching row indices in a data.table. The data.table is keyed by the column I want to match against, so, I think I should be able to use a binary search to find the matching indices.

Example to follow:

Here, I have a data.table, keyed by column c2 and a vector of strings, new_dat, for which I would like to find the row indices.

library(data.table)

## Example data, a keyed data.table
dat <- data.table(c1=1:10, c2=letters[1:10], key='c2')

## Match at some indices (keyed column so should be binary search?)
new_dat <- c('d', 'j')

## This doesn't feel right -- I don't think this is taking advantage of the
## data.table ordering at all
## Tried some dumb stuff like dat[match(new_dat, c2, 0L), .I]
dat[match(new_dat, c2, 0L), ]  # only want the index of the matches
#    c1 c2
# 1:  4  d
# 2: 10  j

## So, this is the desired result,
## but this is just doing ordinary linear search (I say w/o actually looking at the code)
match(new_dat, dat[['c2']], 0L)
# [1]  4 10

Edid

I just realized I could do,

dat[, ind := 1:.N][match(new_dat, c2, 0L), ind]

To get the indices, but still doesn't solve the problem I was trying to portray.

Upvotes: 4

Views: 590

Answers (1)

David Arenburg
David Arenburg

Reputation: 92292

In order to find row indices (while not grouping using the by argument) you could specify which = TRUE while performing the binary join

options(datatable.verbose = TRUE) # Setting to TRUE so we can see the binary join being triggered
dat[new_dat, which = TRUE]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
# [1]  4 10

(This will also work without creating c1 because it doesn't use that column at all)

And, if you just want to perform a normal binary join and see all the values in all the columns, you don't need to use match or create an index, just do

dat[new_dat]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
#    c1 c2
# 1:  4  d
# 2: 10  j

In general, as long as your data is keyed (or you using the on argument in order to join) and new_dat is not of class integer or numeric, data.table will automatically trigger the binary join even if new_dat is passed to the ith argument without being wrapped into . or J. Though, if new_dat is one of the above mentioned classes, data.table will try to perform a row indexing instead. Thus, you will need to use dat[.(new_dat), which = TRUE] or dat[J(new_dat), which = TRUE] in order to force bmerge instead row indexing.

Upvotes: 7

Related Questions