Reputation: 369
I look for motifs in the sequence of characters. Once I find them I would like to plot a graph with the positions of motifs in the sequence (x - length of the sequence (always 100), y - number of motifs on the given position of the sequence).
My idea is to create a table with 100 columns (length of the sequence), that will be filled with e.g "1" on the places of motif position.
I start with the table, in which I have ID of the sequence, and seq containing the sequence:
table <- tibble(id = c("AT1", "AT2", "AT3"),
seq = c("AAGCCCATTTAGGGTTTTTTTTAAGCCCAGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGGGTTTAAGAGAGAGTTTTTTTGAGAGC",
"AACTTGGCCCAAAAAAAAGCCCATTTAGGGTTAAAACAGTAGCAAAAAAACGGACTCTAATTGCTCCGTATTCTTTAGGGTTTGAGAGATTTTTTTAA",
"GTTTTTTTAGGGTTTAGTTAAAAAAATAGCAGGGTTTAGGACTCTAATTTAGGGTTATTCTTCTTCTCTTGAGAGAGATTTTTTTAGGGTAGAGCTAGCA"))
Then, I look for declared motifs (AAAAAAA and TTTTTTT using str_locate_all
(I want to find the position of all motifs):
table2 <- table %>%
mutate(AAAAAAA = str_locate_all(seq, "AAAAAAA")) %>%
mutate(TTTTTTT = str_locate_all(seq, "TTTTTTT"))
What gives, unfortunately, only start
and end
position of the motif:
# A tibble: 3 x 4
id seq AAAAAAA TTTTTTT
<chr> <chr> <list> <list>
1 AT1 AAGCCCATTTAGGGTTTTTTTTAAGCCCAGACCCGGACTCTAATTGCTCCGTATTCTTCTTCTCTTGAGAGGGTTTAAGAGAGAGTTTTTTTGAGAGC <int[,2] [0 × 2]> <int[,2] [2 × 2]>
2 AT2 AACTTGGCCCAAAAAAAAGCCCATTTAGGGTTAAAACAGTAGCAAAAAAACGGACTCTAATTGCTCCGTATTCTTTAGGGTTTGAGAGATTTTTTTAA <int[,2] [2 × 2]> <int[,2] [1 × 2]>
3 AT3 GTTTTTTTAGGGTTTAGTTAAAAAAATAGCAGGGTTTAGGACTCTAATTTAGGGTTATTCTTCTTCTCTTGAGAGAGATTTTTTTAGGGTAGAGCTAGCA <int[,2] [1 × 2]> <int[,2] [2 × 2]>
Here I got stuck, as I do not know how to extract values from the motifs columns. I think about creating an output as follow (split motifs column to separate data.frames
and then filling the positions between start
and end
:
AAAAAAA <- matrix(nrow = 3, ncol = 100)
row.names(AAAAAAA) <- c("AT1", "AT2", "AT3")
AAAAAAA[2, c(11:17, 44:50)] <- 1
AAAAAAA[3, c(20:26)] <- 1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
AT1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT2 NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 1 1 NA NA NA NA NA NA NA
[,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65]
AT1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT2 NA NA NA NA NA NA NA NA NA NA 1 1 1 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97]
AT1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AT3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[,98] [,99] [,100]
AT1 NA NA NA
AT2 NA NA NA
AT3 NA NA NA
Then I think to join back all tables into one big and make a plot using ggplot
.
I describe the whole problem as I am not sure if I approach it in a good way and would appreciate any hints on how to "extract" / "simplify" output of str_locate_all
or general tips.
EDIT My expected final output is the table that let me plot the positions of all motifs over the sequence length (to see whether motifs distribute e.g. at the beginning of sequence).
Upvotes: 1
Views: 410
Reputation: 11981
Here is one possibility:
you can write one function (I called it make_vec
) which takes the sequence and a pattern as input and returns the row of the matrix.
make_vec <- function(seq, pattern) {
loc <- str_locate_all(seq, pattern)[[1]]
res <- integer(100)
idx <- as.numeric(sapply(loc[,1], function(x) x + seq_len(nchar(pattern)) - 1))
res[idx] <- 1L
return(res)
}
Next you can use this function for every sequence and exract the results:
table3 <- table %>%
mutate(
a = map(seq, make_vec, pattern = "AAAAAAA"),
t = map(seq, make_vec, pattern = "TTTTTTT"),
)
matrix(unlist(table3$a), nrow = 3, byrow = TRUE)
Upvotes: 2