Reputation: 467
I have two vectors, one (A) of about 100 million non-unique elements (integers), the other (B) of 1 million of the same, unique, elements. I am trying to get a list containing the indices of the repeated instances of each element of B in A.
A <- c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2)
B <- 1:3
# would result in this:
[[1]]
[1] 2 3 4 6 7
[[2]]
[1] 1 5 10
[[3]]
[1] 8 9
I first, naively, tried this:
b_indices <- lapply(B, function(b) which(A == b))
which is horribly inefficient, and apparently wouldn't complete in a few years.
The second thing I tried was to create a list of empty vectors, indexed with all elements of B, and to then loop through A, appending the index to the corresponding vector for each element in A. Although technically O(n), I'm not sure about the time to repeatedly append elements. This approach would apparently take ~ 2-3 days, which is still too slow...
Is there anything that could work faster?
Upvotes: 5
Views: 382
Reputation: 23231
data.table
is arguably the most efficient way of dealing with Big Data in R and it would even let you avoid having to use that 1 million length vector all together!
require(data.table)
a <- data.table(x=rep(c("a","b","c"),each=3))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: a 1,2,3
2: b 4,5,6
3: c 7,8,9
Using your example data:
a <- data.table(x=c(2, 1, 1, 1, 2, 1, 1, 3, 3, 2))
a[ , list( yidx = list(.I) ) , by = x ]
a yidx
1: 2 1, 5,10
2: 1 2,3,4,6,7
3: 3 8,9
Add this to your benchmarks. I dare say it should be significantly faster than using the built-in functions, if you test it at scale. The bigger the data the better the relative performance of data.table
in my experience.
In my benchmark it only takes about 46% as long as order
on my Debian laptop and only 5% as long as order
on my Windows laptop with 8GB RAM and a 2.x GHz CPU.
B <- seq_len(1e6)
set.seed(42)
A <- data.table(x = sample(B, 1e8, TRUE))
system.time({
+ res <- A[ , list( yidx = list(.I) ) , by = x ]
+ })
user system elapsed
4.25 0.22 4.50
Upvotes: 6
Reputation: 887881
We can also use dplyr
library(dplyr)
data_frame(A) %>%
mutate(B = row_number()) %>%
group_by(A) %>%
summarise(B = list(B)) %>%
.$B
#[[1]]
#[1] 2 3 4 6 7
#[[2]]
#[1] 1 5 10
#[[3]]
#[1] 8 9
In a smaller dataset of 1e5 size, it gives system.time
# user system elapsed
# 0.01 0.00 0.02
but with larger example as showed in the other post, it is slower. However, this is dplyr
...
Upvotes: 2
Reputation: 132969
This is fast:
A1 <- order(A, method = "radix")
split(A1, A[A1])
#$`1`
#[1] 2 3 4 6 7
#
#$`2`
#[1] 1 5 10
#
#$`3`
#[1] 8 9
B <- seq_len(1e6)
set.seed(42)
A <- sample(B, 1e8, TRUE)
system.time({
A1 <- order(A, method = "radix")
res <- split(A1, A[A1])
})
# user system elapsed
#8.650 1.056 9.704
Upvotes: 8