user8173495
user8173495

Reputation: 3

Optimizing matching in R

Hoping someone can help. I have a ton of ortholog mapping to do in R, which is proving to be incredibly time consuming. I've posted an example structure below. Obvious answers such as iterating line by line (for i in 1:nrow(df)) and string splitting, or using sapply have been tried and are incredibly slow. I am therefore hoping for a vectorized option.

stringsasFactors = F

# example accession mapping
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"), 
                  target = c("a b", "c", "d e f", "g", "h i"))

# example protein list
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9"))

# now, map df$sourceIDs to map$target


# expected output
> matches
[1] "a b c" "d e f" "g"     ""      "g h i" 

I appreciate any help!

Upvotes: 0

Views: 110

Answers (1)

Nathan Werth
Nathan Werth

Reputation: 5263

In most cases, the best approach to this kind of problem is to create data.frames with one observation per row.

map_split <- lapply(map, strsplit, split = ' ')
long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE)
all_map <- do.call(rbind, long_mappings)
names(all_map) <- c('source', 'target')

Now all_map looks like this:

   source target
1       1      a
2       1      b
3       2      c
4       4      c
5       3      d
6       3      e
7       3      f
8       4      g
9       6      g
10      8      g
11      9      h
12      9      i

Doing the same for df...

sourceIDs_split <- strsplit(df$sourceIDs, ' ')
df_long <- data.frame(
  index  = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)),
  source = unlist(sourceIDs_split)
)

Give us this for df_long:

  index source
1     1      1
2     1      2
3     2      3
4     3      4
5     4      5
6     5      8
7     5      9

Now they just need to be merged and collapsed.

matches <- merge(df_long, all_map, by = 'source', all.x = TRUE)
tapply(
  matches$target,
  matches$index,
  function(x) {
    paste0(sort(x), collapse = ' ')
  }
)

#       1       2       3       4       5 
# "a b c" "d e f"   "c g"      "" "g h i" 

Upvotes: 1

Related Questions