Reputation: 30311
The fastmatch package implements a much faster version of match
for repeated matches (e.g. in a loop):
set.seed(1)
library(fastmatch)
table <- 1L:100000L
x <- sample(table, 10000, replace=TRUE)
system.time(for(i in 1:100) a <- match(x, table))
system.time(for(i in 1:100) b <- fmatch(x, table))
identical(a, b)
Is there a similar implementation for %in%
I could use to speed up repeated lookups?
Upvotes: 27
Views: 4097
Reputation: 4024
match is almost always better done by putting both vectors in dataframes and merging (see various joins from dplyr)
For example, something like this would give you all the info you need:
library(dplyr)
data = data_frame(data.ID = 1L:100000L,
data.extra = 1:2)
sample =
data %>%
sample_n(10000, replace=TRUE) %>%
mutate(sample.ID = 1:n(),
sample.extra = 3:4 )
# join table not strictly necessary in this case
# but necessary in many-to-many matches
data__sample = inner_join(data, sample)
#check whether a data.ID made it into sample
data__sample %>% filter(data.ID == 1)
or left_join, right_join, full_join, semi_join, anti_join, depending on what info is most useful to you
Upvotes: 4
Reputation: 176668
Look at the definition of %in%
:
R> `%in%`
function (x, table)
match(x, table, nomatch = 0L) > 0L
<bytecode: 0x1fab7a8>
<environment: namespace:base>
It's easy to write your own %fin%
function:
`%fin%` <- function(x, table) {
stopifnot(require(fastmatch))
fmatch(x, table, nomatch = 0L) > 0L
}
system.time(for(i in 1:100) a <- x %in% table)
# user system elapsed
# 1.780 0.000 1.782
system.time(for(i in 1:100) b <- x %fin% table)
# user system elapsed
# 0.052 0.000 0.054
identical(a, b)
# [1] TRUE
Upvotes: 36