Is there a function in R to avoid using loop when we look for all matching index for all element of a vector?

Question

I have this for loop which return the first matching index of every element in the vector but it is very slow (nrow(data) > 50 000)

example:

id1 <- c(1,5,8,10)
id2 <- c(5,8,10,1)

data <- data.frame(id1,id2, idx = 1:length(id1))

results should be :

data$new_id
4 1 2 3

data$new_id <- NA
for(i in 1:nrow(data)){
  data$new_id[i] <- which(data$id2 == data$id1[i])
}

I found that this works for small data frame but unfortunatly R return a "Error: cannot allocate vector of size 22.2 Gb"

A <- outer(data$id1,data$id2, "==")

data <- data %>%
  mutate(new_id = which(t(A)),
         id0 = 0:(nrow(data)-1),
         new_id = new_id-(nrow(data))*id0)

Does other solution exist to do this indexing ?

akrun · Accepted Answer

We can use match which is very fast as a base R function. Here, we are just matching two column of a dataset without even trying to get both datasets together

with(data, match(id1, id2))
#[1] 4 1 2 3

To make this faster, use fmatch from fastmatch

library(fastmatch)
with(data, fmatch(id1, id2))

Benchmarks

set.seed(24)
data1 <- data.frame(id1 = sample(1e7), id2 = sample(1e7))

system.time(with(data1, match(id1, id2)))
#  user  system elapsed 
# 1.635   0.079   1.691 

system.time(with(data1, fmatch(id1, id2)))
#  user  system elapsed 
# 1.155   0.062   1.195 


system.time({
        data2 <- data.table(id = data1$id1)
        data3 <- data.table(id = data1$id2)
         data2[data3, idx := .I, on = .(id)]
   })
#   user  system elapsed 
# 2.306   0.051   2.353

Is there a function in R to avoid using loop when we look for all matching index for all element of a vector?

Answers (2)

Benchmarks

Related Questions