Reputation: 53
I have this for loop which return the first matching index of every element in the vector but it is very slow (nrow(data) > 50 000)
example:
id1 <- c(1,5,8,10)
id2 <- c(5,8,10,1)
data <- data.frame(id1,id2, idx = 1:length(id1))
results should be :
data$new_id
4 1 2 3
data$new_id <- NA
for(i in 1:nrow(data)){
data$new_id[i] <- which(data$id2 == data$id1[i])
}
I found that this works for small data frame but unfortunatly R return a "Error: cannot allocate vector of size 22.2 Gb"
A <- outer(data$id1,data$id2, "==")
data <- data %>%
mutate(new_id = which(t(A)),
id0 = 0:(nrow(data)-1),
new_id = new_id-(nrow(data))*id0)
Does other solution exist to do this indexing ?
Upvotes: 2
Views: 126
Reputation: 887291
We can use match
which is very fast as a base R
function. Here, we are just matching two column of a dataset without even trying to get both datasets together
with(data, match(id1, id2))
#[1] 4 1 2 3
To make this faster, use fmatch
from fastmatch
library(fastmatch)
with(data, fmatch(id1, id2))
set.seed(24)
data1 <- data.frame(id1 = sample(1e7), id2 = sample(1e7))
system.time(with(data1, match(id1, id2)))
# user system elapsed
# 1.635 0.079 1.691
system.time(with(data1, fmatch(id1, id2)))
# user system elapsed
# 1.155 0.062 1.195
system.time({
data2 <- data.table(id = data1$id1)
data3 <- data.table(id = data1$id2)
data2[data3, idx := .I, on = .(id)]
})
# user system elapsed
# 2.306 0.051 2.353
Upvotes: 3
Reputation: 27742
When using large datasets, you could try a data.table join (usually pretty fast). Should be even faster (on large sets) if you set keys first
library( data.table )
#make data.frames out of your vectors
dt1 <- data.table( id = id1 )
dt2 <- data.table (id = id2 )
#update join with indexnumbers from dt2 of dt1, matching id.
dt1[dt2, idx := .I, on = .(id)]
# id idx
# 1: 1 4
# 2: 5 1
# 3: 8 2
# 4: 10 3
NB: this only returns the first matching position!
Upvotes: 3