Deal with the output of `str_locate_all` in tibble

Question

I look for motifs in the sequence of characters. Once I find them I would like to plot a graph with the positions of motifs in the sequence (x - length of the sequence (always 100), y - number of motifs on the given position of the sequence).

My idea is to create a table with 100 columns (length of the sequence), that will be filled with e.g "1" on the places of motif position.

I start with the table, in which I have ID of the sequence, and seq containing the sequence:

table <- tibble(id = c("AT1", "AT2", "AT3"),
                seq = c("AAGCCCATTTAGGGTTTTTTTTAAGCCCAGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGGGTTTAAGAGAGAGTTTTTTTGAGAGC",
                        "AACTTGGCCCAAAAAAAAGCCCATTTAGGGTTAAAACAGTAGCAAAAAAACGGACTCTAATTGCTCCGTATTCTTTAGGGTTTGAGAGATTTTTTTAA",
                        "GTTTTTTTAGGGTTTAGTTAAAAAAATAGCAGGGTTTAGGACTCTAATTTAGGGTTATTCTTCTTCTCTTGAGAGAGATTTTTTTAGGGTAGAGCTAGCA"))

Then, I look for declared motifs (AAAAAAA and TTTTTTT using str_locate_all (I want to find the position of all motifs):

table2 <- table %>% 
  mutate(AAAAAAA = str_locate_all(seq, "AAAAAAA")) %>%
  mutate(TTTTTTT = str_locate_all(seq, "TTTTTTT"))

What gives, unfortunately, only start and end position of the motif:

# A tibble: 3 x 4
  id    seq                                                                                                  AAAAAAA           TTTTTTT          
                                                                                                                          
1 AT1   AAGCCCATTTAGGGTTTTTTTTAAGCCCAGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGGGTTTAAGAGAGAGTTTTTTTGAGAGC    
2 AT2   AACTTGGCCCAAAAAAAAGCCCATTTAGGGTTAAAACAGTAGCAAAAAAACGGACTCTAATTGCTCCGTATTCTTTAGGGTTTGAGAGATTTTTTTAA    
3 AT3   GTTTTTTTAGGGTTTAGTTAAAAAAATAGCAGGGTTTAGGACTCTAATTTAGGGTTATTCTTCTTCTCTTGAGAGAGATTTTTTTAGGGTAGAGCTAGCA

Here I got stuck, as I do not know how to extract values from the motifs columns. I think about creating an output as follow (split motifs column to separate data.frames and then filling the positions between start and end:

AAAAAAA <- matrix(nrow = 3, ncol = 100)
row.names(AAAAAAA) <- c("AT1", "AT2", "AT3")
AAAAAAA[2, c(11:17, 44:50)] <- 1
AAAAAAA[3, c(20:26)] <- 1

    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
AT1   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT2   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA     1     1     1     1     1     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT3   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA     1     1     1     1     1     1     1    NA    NA    NA    NA    NA    NA    NA
    [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65]
AT1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA     1     1     1     1     1     1     1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
    [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97]
AT1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
AT3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
    [,98] [,99] [,100]
AT1    NA    NA     NA
AT2    NA    NA     NA
AT3    NA    NA     NA

Then I think to join back all tables into one big and make a plot using ggplot.

I describe the whole problem as I am not sure if I approach it in a good way and would appreciate any hints on how to "extract" / "simplify" output of str_locate_all or general tips.

EDIT My expected final output is the table that let me plot the positions of all motifs over the sequence length (to see whether motifs distribute e.g. at the beginning of sequence).

Cettt · Accepted Answer

Here is one possibility:

you can write one function (I called it make_vec) which takes the sequence and a pattern as input and returns the row of the matrix.

make_vec <- function(seq, pattern) {
  loc <- str_locate_all(seq, pattern)[[1]]
  res <- integer(100)
  idx <- as.numeric(sapply(loc[,1], function(x) x + seq_len(nchar(pattern)) - 1))
  res[idx] <- 1L
  return(res)
}

Next you can use this function for every sequence and exract the results:

table3 <- table %>% 
  mutate(
    a = map(seq, make_vec, pattern = "AAAAAAA"),
    t = map(seq, make_vec, pattern = "TTTTTTT"),
  )

  matrix(unlist(table3$a), nrow = 3, byrow = TRUE)

Deal with the output of `str_locate_all` in tibble

Answers (1)

Related Questions