Adrien
Adrien

Reputation: 53

Is there a function in R to avoid using loop when we look for all matching index for all element of a vector?

I have this for loop which return the first matching index of every element in the vector but it is very slow (nrow(data) > 50 000)

example:

id1 <- c(1,5,8,10)
id2 <- c(5,8,10,1)

data <- data.frame(id1,id2, idx = 1:length(id1))

results should be :

data$new_id
4 1 2 3
data$new_id <- NA
for(i in 1:nrow(data)){
  data$new_id[i] <- which(data$id2 == data$id1[i])
}

I found that this works for small data frame but unfortunatly R return a "Error: cannot allocate vector of size 22.2 Gb"

A <- outer(data$id1,data$id2, "==")

data <- data %>%
  mutate(new_id = which(t(A)),
         id0 = 0:(nrow(data)-1),
         new_id = new_id-(nrow(data))*id0)

Does other solution exist to do this indexing ?

Upvotes: 2

Views: 126

Answers (2)

akrun
akrun

Reputation: 887291

We can use match which is very fast as a base R function. Here, we are just matching two column of a dataset without even trying to get both datasets together

with(data, match(id1, id2))
#[1] 4 1 2 3

To make this faster, use fmatch from fastmatch

library(fastmatch)
with(data, fmatch(id1, id2))

Benchmarks

set.seed(24)
data1 <- data.frame(id1 = sample(1e7), id2 = sample(1e7))

system.time(with(data1, match(id1, id2)))
#  user  system elapsed 
# 1.635   0.079   1.691 

system.time(with(data1, fmatch(id1, id2)))
#  user  system elapsed 
# 1.155   0.062   1.195 


system.time({
        data2 <- data.table(id = data1$id1)
        data3 <- data.table(id = data1$id2)
         data2[data3, idx := .I, on = .(id)]
   })
#   user  system elapsed 
# 2.306   0.051   2.353 

Upvotes: 3

Wimpel
Wimpel

Reputation: 27742

When using large datasets, you could try a data.table join (usually pretty fast). Should be even faster (on large sets) if you set keys first

library( data.table )
#make data.frames out of your vectors
dt1 <- data.table( id = id1 )
dt2 <- data.table (id = id2 )
#update join with indexnumbers from dt2 of dt1, matching id.
dt1[dt2, idx := .I, on = .(id)]

#    id idx
# 1:  1   4
# 2:  5   1
# 3:  8   2
# 4: 10   3

NB: this only returns the first matching position!

Upvotes: 3

Related Questions