Reputation: 6552

Creating an occurrence matrix for each char from a large number of equal length strings

I am trying to create a matrix which gives me the occurrence of each element at each position, based on a large number of strings in a vector.

I have the following pet example and potential solution:

set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
test <- lapply(seqs, FUN = function(s) {
  do.call(cbind, lapply(LETTERS, FUN = function(ch) {
    grepl(ch, unlist(strsplit(s, split="")))
  }))
})

testR <- Reduce("+", test)
seqs
# [1] "XYHVQNTDRSL" "SYGMYZDMOXD" "ZYCNKXLVTVK" "RAVAFXPJLAZ" "LYXQZQIJKUB" "TREGNRZTOWE" "HVSGBDFMFSA" "JNAPEJQUOGC" "CHRAFYYTINT"
#[10] "QQFFKYZTTNA"
testR
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
 [1,]    0    0    1    0    0    0    0    1    0     1     0     1     0     0     0     0     1     1     1     1     0     0     0
 [2,]    1    0    0    0    0    0    0    1    0     0     0     0     0     1     0     0     1     1     0     0     0     1     0
 [3,]    1    0    1    0    1    1    1    1    0     0     0     0     0     0     0     0     0     1     1     0     0     1     0
 [4,]    2    0    0    0    0    1    2    0    0     0     0     0     1     1     0     1     1     0     0     0     0     1     0
 [5,]    0    1    0    0    1    2    0    0    0     0     2     0     0     1     0     0     1     0     0     0     0     0     0
 [6,]    0    0    0    1    0    0    0    0    0     1     0     0     0     1     0     0     1     1     0     0     0     0     0
 [7,]    0    0    0    1    0    1    0    0    1     0     0     1     0     0     0     1     1     0     0     1     0     0     0
 [8,]    0    0    0    1    0    0    0    0    0     2     0     0     2     0     0     0     0     0     0     3     1     1     0
 [9,]    0    0    0    0    0    1    0    0    1     0     1     1     0     0     3     0     0     1     0     2     0     0     0
[10,]    1    0    0    0    0    0    1    0    0     0     0     0     0     2     0     0     0     0     2     0     1     1     1
[11,]    2    1    1    1    1    0    0    0    0     0     1     1     0     0     0     0     0     0     0     1     0     0     0
      [,24] [,25] [,26]
 [1,]     1     0     1
 [2,]     0     4     0
 [3,]     1     0     0
 [4,]     0     0     0
 [5,]     0     1     1
 [6,]     2     2     1
 [7,]     0     1     2
 [8,]     0     0     0
 [9,]     0     0     0
[10,]     1     0     0
[11,]     0     0     1

I am trying to force myself to not use loops but instead use the vectorised functions but I am not sure if my solution is actually a good (efficient) solution or if I have gotten confused somewhere. It is also fairly difficult to debug if real life data messes up somehow (which sadly is the case).

So my question, what is a good way to solve this problem?

EDIT: Following 989's track, I have done a benchmark test of the proposed solutions here, with a data size more representative of the problem at hand.

library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10000, FUN = function(x) { paste(sample(LETTERS, size = 31, replace = T), collapse = "") })

f.posdef=function(){
  test <- lapply(seqs, FUN = function(s) {
    do.call(cbind, lapply(LETTERS, FUN = function(ch) {
      grepl(ch, unlist(strsplit(s, split="")))
    }))
  })
  (testR <- Reduce("+", test))
}

f.989=function() {
  l <- lapply(seqs, function(x) {
    m <- matrix(0, nchar(x), 26)
    replace(m, cbind(seq(nchar(x)),  match(strsplit(x, "")[[1]], LETTERS)), 1)
  })
  Reduce("+",l)
}

f.docendo1=function()
  t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
    table(factor(xx, levels = LETTERS), 1:31))))

f.docendo2=function()
  t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:31, 10000)))

f.akrun=function(){
  strsplit(seqs, "") %>% 
    transpose %>% 
    map(unlist) %>% 
    setNames(seq_len(nchar(seqs[1]))) %>% 
    stack %>%
    select(2:1) %>% 
    table
}

r <- f.posdef()

Note that the main difference between this benchmark and 989's is the sample size.

> all(r==f.989())
[1] TRUE
> all(r==f.docendo1())
[1] TRUE
> all(r==f.docendo2())
[1] TRUE
> all(r==f.akrun())
[1] FALSE
> res <- microbenchmark(f.posdef(), f.989(), f.docendo1(), f.docendo2(), f.akrun())
> autoplot(res)

As the plot shows, akrun's solution is blazing fast but seemingly inaccurate. Thus the gold medal goes to docendo's second solution. However it's probably worth noting that both solutions by docendo as well as the suggestion by 989 has assumptions regarding the length/number of the sample strings or the alphabet size in m <- matrix(0, nchar(x), 26)

In the case of size/length of sample strings (i.e. seqs) it would be an additional call to nchar which should not impact the runtime much. I am not sure how one would avoid making the assumption alphabet size, if this is not known a priori.

Upvotes: 4

Answers (3)

talat

Reputation: 70296

Here's another approach in base R which requires less looping than the approach by OP:

t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
                               table(factor(xx, levels = LETTERS), 1:11))))

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or, possibly more efficient:

t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:nchar(seqs[1]), length(seqs))))

Upvotes: 4

989

Reputation: 12935

You could go for match in base R:

l <- lapply(seqs, function(x) {
    m <- matrix(0, nchar(x), 26)
    replace(m, cbind(seq(nchar(x)),  match(strsplit(x, "")[[1]], LETTERS)), 1)
})

all(Reduce("+",l)==testR)
#[1] TRUE

BENCHMARKING (I did not include @akrun's answers as I do not want to install the required packages)

library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })

fOP=function(){
test <- lapply(seqs, FUN = function(s) {
   do.call(cbind, lapply(LETTERS, FUN = function(ch) {
    grepl(ch, unlist(strsplit(s, split="")))
   }))
 })

 (testR <- Reduce("+", test))
 }

f989=function() {
    l <- lapply(seqs, function(x) {
    m <- matrix(0, nchar(x), 26)
    replace(m, cbind(seq(nchar(x)),  match(strsplit(x, "")[[1]], LETTERS)), 1)
    })
    Reduce("+",l)
}

fdocendo.discimus=function()
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
                               table(factor(xx, levels = LETTERS), 1:11))))

fdocendo.discimus1=function()
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:11, 10)))

r <- fOP()
all(r==f989())
# [1] TRUE
all(r==fdocendo.discimus())
# [1] TRUE
all(r==fdocendo.discimus1())
# [1] TRUE
res <- microbenchmark(fOP(), f989(), fdocendo.discimus(), fdocendo.discimus1())

print(res, order="mean")
# Unit: microseconds
                 # expr      min        lq      mean    median       uq      max neval
               # f989()  135.813  150.8360  205.3294  154.1415  159.700 4968.565   100
 # fdocendo.discimus1()  391.813  405.1845  447.6911  418.2545  445.146 2418.480   100
  # fdocendo.discimus()  943.775  990.9495 1090.9905 1015.5880 1062.311 3996.245   100
                # fOP() 1486.725 1521.4280 1643.1604 1548.9215 1602.104 5782.838   100

Upvotes: 3

akrun

Reputation: 887431

We can also use table once

library(tidyverse)
strsplit(seqs, "") %>% 
       transpose %>% 
       map(unlist) %>% 
       setNames(seq_len(nchar(seqs[1]))) %>% 
       stack %>%
       select(2:1) %>% 
       table
#   values
#ind  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#  1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#  2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#  3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#  4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#  5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#  6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#  7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#  8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#  9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#  10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#  11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or slightly more compact is by using mtabulate from qdapTools

library(qdapTools)
strsplit(seqs, "") %>% 
         transpose %>% 
         map(unlist) %>%
         mtabulate
#   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Upvotes: 3

Creating an occurrence matrix for each char from a large number of equal length strings

Answers (3)

Related Questions