Reputation: 63
From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that
For my problem n=10; So I have below code and result:
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )
Result:
[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.
Like for the above results, the result should be:
1 7
31 3
here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT. Same for AAACCCTTTGGGAGT" 31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT
Upvotes: 1
Views: 73
Reputation: 160447
Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.
func <- function(dna, n = 10, starts, stops) {
if (length(n) == 1L) n <- c(2L, n)
startptn <- paste0("(", paste(starts, collapse = "|"), ")")
stopptn <- paste0("(", paste(stops, collapse = "|"), ")")
starts_ind <- gregexpr(startptn, dna)
stops_ind <- gregexpr(stopptn, dna)
# stops_ind ends on the first char of the triplet, so add 2
stops_ind <- lapply(stops_ind, `+`, 2L)
candidates <- Map(function(bgn, end, txt) {
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
vec <- mtx[nzchar(mtx)]
vec
}, starts_ind, stops_ind, dna)
# "6L" is the first/last triplet defined in starts/stops
cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)
lens <- lengths(candidates)
df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
df$match <- unlist(candidates)
df$inner <- substring(df$match, 4, nchar(df$match) - 3)
df$ntriplets <- nchar(df$inner) / 3
if (nrow(df) > 0) {
df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
n[1] <= df$ntriplets &
df$ntriplets <= n[2], , drop = FALSE ]
}
df
}
Demo:
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
# id dna match inner ntriplets
# 1 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 2 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 4 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 6 1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
# 7 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGT CCCACT 2
# 8 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG 7
# 10 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG 13
# 12 2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT AAACCCTTTGGGAGT CCCTTTGGG 3
In the input:
dna
can produce 0 or more rows in the output; I use c(dna, dna)
merely to demonstrate that this works on multiple dna strings, vectorized;n
is length 1, then it defaults to ranging between 2
and n
; if length 2, then both are used appropriately;In the output:
id
numbers each of the dna
input strings;dna
is the actual string; I include both id
and dna
in case a string is accidentally repeated (minor);match
is the matching substring, including the start and stop triplets;inner
is the same substring with the start/stop triplets removed;ntriplets
is really just the nchar
of $inner
;/3
reduces it, and %% 1
should be 0), and then how many triplets we have based on the input n
(If you want to see all matches, set n
to Inf
, though it'll still filter out non-triplet inner
strings.)
As to your request of a matrix of lengths, if we insert a browser()
in the Map
and form mtx
, we'll see that we are using matrices:
bgn
# [1] 1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
# [,1] [,2] [,3]
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] "" "AAAATTTGGGAGT" "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [3,] "" "" "AAACCCTTTGGGAGT"
nchar(mtx) - 6
# [,1] [,2] [,3]
# [1,] 6 21 39
# [2,] -6 7 25
# [3,] -6 -6 9
(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)
To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.
Upvotes: 1
Reputation: 1
i have an ugly solution. You just need to analysis your pattern.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )
triplet_matrix <- str_locate_all(dna, regex)
# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_1)
# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon
triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
lengthtriplet = triplet_length_2)
I hope that i could help you.
Upvotes: 0