user1941884
user1941884

Reputation: 47

Sequence recognition, counting of occurences and retrieving part of sequence

I don't think it is really hard what I want to do, but I'm lacking the proper R knowledge for doing these kinds of things. So help is truly appreciated!

I have a file containing protein names and sequences, so something like this:

Protein1 ABCDEFGHIJKLMNOPQRSTUWXYZ
Protein2 ABCDEFGHIJKUVMNOPQRSTUVWXYZ
Protein3 ABCUVDEFGHIJKLMNOPQRSTVVW

I'm looking for proteins that contain the pattern 'UU', 'UV' or 'VV'. I did that using:

(Edit: this is a simplified example, currently I'm looking at triplets ("[UV][UV][UV]"))

y <- x[grep("[UV][UV]", x[,2]),]

So now I know which ones do have the pattern, but I want more. First of all, I want to know how often this pattern is present in the sequence, but I couldn't find out how to do this so far. So that's question number 1.

Question number 2: I want to extract the pattern + part of the sequence in front. So far I've used:

pattern <- "[A-Z]{5}[UV][UV]"
locs <- regexpr(pattern, y[,2])
z <- substr(y[,2], locs, locs+attr(locs,"match.length")-1)

This does work, but only for one account of the pattern, it doesn't include all cases in which the pattern occurs.

What I would like to end up with is something containing this information:

Protein name,
number of patterns found in the sequence,
pattern + part of the desired sequence in front

In my example the results will be something like this:

Protein1
0

Protein2
2
GHIJKUV
PQRSTUV

Protein3 
2
ABCUV  #don't know about this one, since the sequence in front is shorter than 5. For me it would be best if these would not appear.
PQRSTVV

Edit: In the end I would like to have a data matrix to save into a text file, so I can share it with others. Then preferable I would like to end up with something like this:

ProteinName Count Sequence1 Sequence2 Sequence3 SequenceMax
Protein1    0 
Protein2    2     GHIJKUV   PQRSTUV

Upvotes: 3

Views: 249

Answers (2)

agstudy
agstudy

Reputation: 121578

I assume your sequences are in a list

ll <- list('Protein1 ABCDEFGHIJKLMNOPQRSTUWXYZ',
'Protein2 ABCDEFGHIJKUVMNOPQRSTUVWXYZ',
'Protein3 ABCUVDEFGHIJKLMNOPQRSTVVW')

This works:

 sapply(ll, function(x) 
              regmatches(x,gregexpr('[A-Z]{5}UU|[A-Z]{5}UV|[A-Z]{5}VV', x)))


 [[1]]
 character(0)

[[2]]
[1] "GHIJKUV" "PQRSTUV"

[[3]]
[1] "PQRSTVV"

Edit : match any length of any combination of U and V

pattern <- '[A-Z]{5}(U|V)(V|U)+'    ## match pattern begin with U or V
                                    ## followed by at least one U or V

for example , I modify your data to insert longer pattern

ll <- list('Protein1 ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'Protein2 ABCDEFGHIJKUVMNOPQRSTUUVWXYZ',
           'Protein3 ABCUVDEFGHIJUVVKLMNOPQRSTVUUUW')

sapply(ll, function(x)  regmatches(x,gregexpr(pattern, x)))

[[1]]
[1] "PQRSTUV"

[[2]]
[1] "GHIJKUV"  "PQRSTUUV"

[[3]]
[1] "FGHIJUVV"  "PQRSTVUUU"

Upvotes: 2

IRTFM
IRTFM

Reputation: 263352

For numbers of matches:

> sapply( strsplit(dat[[2]], "UU|UV"), length) -1
[1] 0 2 1

To isolate the sequences, check to see which of the results are not the same number of characters as the input:

> sub("(.+)(.{5}UU|.{5}UV)(.+)", "\\2", dat[[2]])
[1] "ABCDEFGHIJKLMNOPQRSTUWXYZ" "PQRSTUV"                   "ABCUVDEFGHIJKLMNOPQRSTVVW"

To bind them together:

> apply(dat, 1, function(x) list(count=sapply( strsplit(x[2], "UU|UV"), length) -1 , matches= { mat <- gsub("(.+)(.{5}UU|.{5}UV)(.+)", "\\2", x[2]); if(!nchar(mat) ==nchar(x[2]) ) {mat}else{""} }))
[[1]]
[[1]]$count
V2 
 0 

[[1]]$matches
[1] ""


[[2]]
[[2]]$count
V2 
 2 

[[2]]$matches
       V2 
"PQRSTUV" 


[[3]]
[[3]]$count
V2 
 1 

[[3]]$matches
[1] ""

Upvotes: 3

Related Questions