Find the sequence using R

Question

How to write a function that accepts a DNA sequence (as a single string) and a number “n >= 2” and returns a vector with all DNA subsequences (as strings) that start with the triplet “AAA” or “GAA” and end with the triplet “AGT” and have at least 2 and at most “n” other triplets between the start and the end.

Q1:

for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2, 
the answer is c=(“GAACCCACTAGT”, “AAATTTGGGAGT”).

Q2:

e.g, n=10
the answer is:  c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")

Wimpel · Accepted Answer

here is a possible approach.

it uses a regex based on 2 -> n repetitions of three [A-Z] as it's core.

library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10  # << set as desired

#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"

stringr::str_extract_all( dna, regex )

# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"

# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"

Find the sequence using R

Answers (1)

Related Questions