shrinirajesh
shrinirajesh

Reputation: 21

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do eg:

seq="ATGCAAAGGT"

the patterns are

pattern=c("ATGC","AAGG")

How am I supposed to find these two patterns simultaneously in the sequence?

I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.

Can anyone help me with this ?

Upvotes: 0

Views: 322

Answers (3)

colin
colin

Reputation: 2666

Lets say your sequence file is just a vector of sequences:

seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')

You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:

grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1]  TRUE  TRUE FALSE

Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:

id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')

You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:

d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
  id     sequence match
1 s1   ATGCAAAGGT  TRUE
2 s2   ATGCTAAGGT  TRUE
3 s3 NOTINTHISONE FALSE

The following for-loop can return a list of the positions of each of the patterns within the sequence:

require(stringr)

for(i in 1: length(d$sequence)){
    out <- str_locate_all(d$sequence[i], pattern)
    first    <- c(out[[1]])
    first.o  <- paste(first[1],first[2],sep=',')
    second   <- c(out[[2]])
    second.o <- paste(second[1],second[2], sep=',')
    print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"

Upvotes: 3

989
989

Reputation: 12937

How about this using stringr to find start and end positions:

library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)

#[[1]]
#     start end
#[1,]     1   4
#
#[[2]]
#     start end
#[1,]     6   9

Upvotes: 1

Gopala
Gopala

Reputation: 10483

You can try using the stringr library to do something like this:

seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"

Without knowing more specifically what output you are looking for, this is the best I can provide right now.

Upvotes: 2

Related Questions