Reputation: 705
I have a list of vectors of integers, for example:
set.seed(1)
vec_list <- replicate(100, sample(1:10000000, size=sample(1:10000, 100)), simplify=FALSE)
And a vector of integers, for example:
vec <- sample(1:10000000, size=10000)
How can I count the number of intergers in each vector in vec_list that appear in the vector vec? I can do this using a for loop. For example:
total_match <- rep(NA, length(vec_list))
for (i in 1:length(vec_list)){
total_match[i] <- length(which(vec_list[[i]] %in% vec))
print(i)
}
However, the list and vector I am trying to apply this too are very large, and this is slow. Please help with suggestions on how to improve performance.
Using data.table is much faster, but does not return 0's when there are no matches. For example:
DT <- data.table(repid=rep(1:length(vec_list), sapply(vec_list, length)), val=unlist(vec_list))
total_match2 <- DT[.(vec), on=.(val), nomatch=0L, .N, keyby=.(repid)]$N
Upvotes: 2
Views: 504
Reputation: 66819
Another, a variant of @chinsoon's:
nvec = 5000
max_size = 10000
nv = 10000000
set.seed(1)
vec_list <- replicate(nvec, sample(nv, size=sample(max_size, 1)), simplify=FALSE)
vec <- sample(nv, size=max_size)
system.time(
res <- rbindlist(lapply(vec_list, list), id=TRUE)[.(vec), on=.(V1), nomatch=0, .N, keyby=.id]
)
# user system elapsed
# 0.86 0.20 0.47
system.time({
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
res2 <- DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
})
# user system elapsed
# 1.03 0.45 1.00
identical(res2[res2 != 0], res$N) # TRUE
Upvotes: 1
Reputation: 25225
Maybe try:
DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
Upvotes: 1