Reputation: 21
If I want to find two different patterns in a single sequence how am I supposed to do eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Upvotes: 0
Views: 322
Reputation: 2666
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d
, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d
, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
Upvotes: 3
Reputation: 12937
How about this using stringr
to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9
Upvotes: 1
Reputation: 10483
You can try using the stringr
library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
Upvotes: 2