Gerome Pistre
Gerome Pistre

Reputation: 467

Get indices of repeated instances of elements of a vector in other vector (both very large)

I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.

A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3

# would result in this:
[[1]]
[1] 2 3 4 6 7

[[2]]
[1]  1  5 10

[[3]]
[1] 8 9

I first, naively, tried this:

b_indices <- lapply(B, function(b) which(A == b))

which is horribly inefficient, and apparently wouldn't complete in a few years.

The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...

Is there anything that could work faster?

Upvotes: 5

Views: 382

Answers (3)

Hack-R
Hack-R

Reputation: 23231

data.table is arguably the most efficient way of dealing with Big Data in R and it would even let you avoid having to use that 1 million length vector all together!

require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]

   a  yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9

Using your example data:

a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]

   a      yidx
1: 2   1, 5,10
2: 1 2,3,4,6,7
3: 3       8,9

Add this to your benchmarks. I dare say it should be significantly faster than using the built-in functions, if you test it at scale. The bigger the data the better the relative performance of data.table in my experience.

In my benchmark it only takes about 46% as long as order on my Debian laptop and only 5% as long as order on my Windows laptop with 8GB RAM and a 2.x GHz CPU.

B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+   res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
   user  system elapsed 
   4.25    0.22    4.50 

Upvotes: 6

akrun
akrun

Reputation: 887881

We can also use dplyr

library(dplyr)
data_frame(A) %>% 
      mutate(B = row_number()) %>%
      group_by(A) %>%
      summarise(B = list(B)) %>% 
      .$B

#[[1]]
#[1] 2 3 4 6 7

#[[2]]
#[1]  1  5 10

#[[3]]
#[1] 8 9

In a smaller dataset of 1e5 size, it gives system.time

#   user  system elapsed 
#   0.01    0.00    0.02 

but with larger example as showed in the other post, it is slower. However, this is dplyr...

Upvotes: 2

Roland
Roland

Reputation: 132969

This is fast:

A1 <- order(A, method = "radix")

split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1]  1  5 10
#
#$`3`
#[1] 8 9

B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)

system.time({
  A1 <- order(A, method = "radix")

  res <- split(A1, A[A1])
})
# user      system     elapsed 
#8.650       1.056       9.704

Upvotes: 8

Related Questions