Powege
Powege

Reputation: 705

How to efficiently count matches across a list in R?

I have a list of vectors of integers, for example:

set.seed(1)

vec_list <- replicate(100, sample(1:10000000, size=sample(1:10000, 100)), simplify=FALSE)

And a vector of integers, for example:

vec <- sample(1:10000000, size=10000)

How can I count the number of intergers in each vector in vec_list that appear in the vector vec? I can do this using a for loop. For example:

total_match <- rep(NA, length(vec_list))

for (i in 1:length(vec_list)){
  total_match[i] <- length(which(vec_list[[i]] %in% vec))
  print(i)
}

However, the list and vector I am trying to apply this too are very large, and this is slow. Please help with suggestions on how to improve performance.

Using data.table is much faster, but does not return 0's when there are no matches. For example:

DT <- data.table(repid=rep(1:length(vec_list), sapply(vec_list, length)), val=unlist(vec_list))
total_match2 <- DT[.(vec), on=.(val), nomatch=0L, .N, keyby=.(repid)]$N

Upvotes: 2

Views: 504

Answers (3)

Frank
Frank

Reputation: 66819

Another, a variant of @chinsoon's:

nvec = 5000
max_size = 10000
nv = 10000000

set.seed(1)
vec_list <- replicate(nvec, sample(nv, size=sample(max_size, 1)), simplify=FALSE)
vec <- sample(nv, size=max_size)

system.time(
  res <- rbindlist(lapply(vec_list, list), id=TRUE)[.(vec), on=.(V1), nomatch=0, .N, keyby=.id]
)
#    user  system elapsed 
#    0.86    0.20    0.47 


system.time({
  DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
  res2 <- DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1
})
#    user  system elapsed 
#    1.03    0.45    1.00 


identical(res2[res2 != 0], res$N) # TRUE

Upvotes: 1

chinsoon12
chinsoon12

Reputation: 25225

Maybe try:

DT <- setDT(stack(setNames(vec_list, 1:length(vec_list))))
DT[, x := +(values %in% vec)][, sum(x), keyby=.(ind)]$V1

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

What about:

sapply(vec_list, function(x) sum(x %in% vec))

Upvotes: 2

Related Questions