Reputation: 499
What I want to do
I have a number of unique ids, e.g. id1, id2, etc.
. They appear in a number of groups, and each group is a random sample of between 1 and 100 ids, e.g.
[1] "id872- id103- id746-"
[2] "id830- id582-"
[3] "id663- id403- id675- id584- id866- id399- id28- id550-"
I want to find all the groups to which an id belongs, e.g. for id1
groups.with.id[1] = grep("id1-", groups)
My problem
The data is large (80k ids, 1M groups) and this takes about 12 hours on my laptop. I think is is a combination of the facts I'm doing something 80k times (once for every id), and that every time I'm accessing a huge object (1M groups).
Is there any way to make this more efficient?
(I often have issues with larger objects in R. E.g. repeatedly indexing an dataframe with >10k rows gets slow very quickly. General answers for efficient indexing of large objects are also appreciated!)
Example code
# make toy dataset
n.ids = 80000 # number of unique ids
n.groups = 100 # number of groupings
# make data
ids = paste("id", 1:n.ids, "-", sep="")
groups = character(n.groups)
for (ii in 1:n.groups) {
group.size = ceiling(runif(1) * 10)
groups[ii] = paste(sample(ids, group.size), collapse=" ")
}
# execute and time
t1 = Sys.time()
groups.with.id = sapply(ids, grep, groups)
difftime(Sys.time(), t1)
> Time difference of 29.63272 secs
In this example 500 groups take ~30s. I'd like to scale up to 1M groups. Is there an efficient way to do so?
Upvotes: 5
Views: 362
Reputation: 34416
Here is a tidyverse
alternative. One possible shortcoming is that IDs that don't exist in the data won't be included in the output - although this could be worked around if necessary.
library(tidyverse)
groups.with.id <- groups %>%
enframe() %>%
separate_rows(value, sep = " ") %>%
group_by(value) %>%
summarise(ids = list(name)) %>%
mutate(ids = set_names(ids, value)) %>%
pull(ids)
If you also want to know which IDs are not present in any of the groups you can use:
setdiff(ids, names(groups.with.id))
Benchmarking indicates this approach is more than 100 times faster on the example dataset than using grep
with fixed = TRUE
.
microbenchmark::microbenchmark(original = sapply(ids, grep, groups),
original_fixed = sapply(ids, grep, groups, fixed = TRUE),
separate_summarise_groups = {groups %>%
enframe() %>%
separate_rows(value, sep = " ") %>%
group_by(value) %>%
summarise(ids = list(name)) %>%
mutate(ids = set_names(ids, value)) %>%
pull(ids)}, times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
original 685.0922 695.7236 382.0759 641.2018 290.30233 188.40790 10 c
original_fixed 199.8912 209.1225 115.5693 199.9749 85.89842 59.26886 10 b
separate_summarise_groups 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 10 a
Upvotes: 3