R Greg Stacey
R Greg Stacey

Reputation: 499

Tips for efficient string matching (and indexing) for large data in R?

What I want to do

I have a number of unique ids, e.g. id1, id2, etc.. They appear in a number of groups, and each group is a random sample of between 1 and 100 ids, e.g.

[1] "id872- id103- id746-"                             
[2] "id830- id582-"                                   
[3] "id663- id403- id675- id584- id866- id399- id28- id550-"

I want to find all the groups to which an id belongs, e.g. for id1

groups.with.id[1] = grep("id1-", groups)

My problem

The data is large (80k ids, 1M groups) and this takes about 12 hours on my laptop. I think is is a combination of the facts I'm doing something 80k times (once for every id), and that every time I'm accessing a huge object (1M groups).

Is there any way to make this more efficient?

(I often have issues with larger objects in R. E.g. repeatedly indexing an dataframe with >10k rows gets slow very quickly. General answers for efficient indexing of large objects are also appreciated!)

Example code

# make toy dataset
n.ids = 80000 # number of unique ids
n.groups = 100 # number of groupings

# make data
ids = paste("id", 1:n.ids, "-", sep="")
groups = character(n.groups)
for (ii in 1:n.groups) {
  group.size = ceiling(runif(1) * 10)
  groups[ii] = paste(sample(ids, group.size), collapse=" ")
}

# execute and time
t1 = Sys.time()
groups.with.id = sapply(ids, grep, groups)
difftime(Sys.time(), t1)

> Time difference of 29.63272 secs

In this example 500 groups take ~30s. I'd like to scale up to 1M groups. Is there an efficient way to do so?

Upvotes: 5

Views: 362

Answers (1)

lroha
lroha

Reputation: 34416

Here is a tidyverse alternative. One possible shortcoming is that IDs that don't exist in the data won't be included in the output - although this could be worked around if necessary.

library(tidyverse)

groups.with.id <- groups %>% 
  enframe() %>%
  separate_rows(value, sep = " ") %>%
  group_by(value) %>%
  summarise(ids = list(name)) %>%
  mutate(ids = set_names(ids, value)) %>%
  pull(ids)

If you also want to know which IDs are not present in any of the groups you can use:

setdiff(ids, names(groups.with.id))

Benchmarking indicates this approach is more than 100 times faster on the example dataset than using grep with fixed = TRUE.

microbenchmark::microbenchmark(original = sapply(ids, grep, groups),
                               original_fixed = sapply(ids, grep, groups, fixed = TRUE),
                               separate_summarise_groups = {groups %>% 
                                   enframe() %>%
                                   separate_rows(value, sep = " ") %>%
                                   group_by(value) %>%
                                   summarise(ids = list(name)) %>%
                                   mutate(ids = set_names(ids, value)) %>%
                                   pull(ids)}, times = 10, unit = "relative")

Unit: relative
                      expr      min       lq     mean   median        uq       max neval cld
                  original 685.0922 695.7236 382.0759 641.2018 290.30233 188.40790    10   c
            original_fixed 199.8912 209.1225 115.5693 199.9749  85.89842  59.26886    10  b 
 separate_summarise_groups   1.0000   1.0000   1.0000   1.0000   1.00000   1.00000    10 a 

Upvotes: 3

Related Questions