Reputation: 47
I don't think it is really hard what I want to do, but I'm lacking the proper R knowledge for doing these kinds of things. So help is truly appreciated!
I have a file containing protein names and sequences, so something like this:
Protein1 ABCDEFGHIJKLMNOPQRSTUWXYZ
Protein2 ABCDEFGHIJKUVMNOPQRSTUVWXYZ
Protein3 ABCUVDEFGHIJKLMNOPQRSTVVW
I'm looking for proteins that contain the pattern 'UU'
, 'UV'
or 'VV'
. I did that using:
(Edit: this is a simplified example, currently I'm looking at triplets ("[UV][UV][UV]"))
y <- x[grep("[UV][UV]", x[,2]),]
So now I know which ones do have the pattern, but I want more. First of all, I want to know how often this pattern is present in the sequence, but I couldn't find out how to do this so far. So that's question number 1.
Question number 2: I want to extract the pattern + part of the sequence in front. So far I've used:
pattern <- "[A-Z]{5}[UV][UV]"
locs <- regexpr(pattern, y[,2])
z <- substr(y[,2], locs, locs+attr(locs,"match.length")-1)
This does work, but only for one account of the pattern, it doesn't include all cases in which the pattern occurs.
What I would like to end up with is something containing this information:
Protein name,
number of patterns found in the sequence,
pattern + part of the desired sequence in front
In my example the results will be something like this:
Protein1
0
Protein2
2
GHIJKUV
PQRSTUV
Protein3
2
ABCUV #don't know about this one, since the sequence in front is shorter than 5. For me it would be best if these would not appear.
PQRSTVV
Edit: In the end I would like to have a data matrix to save into a text file, so I can share it with others. Then preferable I would like to end up with something like this:
ProteinName Count Sequence1 Sequence2 Sequence3 SequenceMax
Protein1 0
Protein2 2 GHIJKUV PQRSTUV
Upvotes: 3
Views: 249
Reputation: 121578
I assume your sequences are in a list
ll <- list('Protein1 ABCDEFGHIJKLMNOPQRSTUWXYZ',
'Protein2 ABCDEFGHIJKUVMNOPQRSTUVWXYZ',
'Protein3 ABCUVDEFGHIJKLMNOPQRSTVVW')
This works:
sapply(ll, function(x)
regmatches(x,gregexpr('[A-Z]{5}UU|[A-Z]{5}UV|[A-Z]{5}VV', x)))
[[1]]
character(0)
[[2]]
[1] "GHIJKUV" "PQRSTUV"
[[3]]
[1] "PQRSTVV"
Edit : match any length of any combination of U and V
pattern <- '[A-Z]{5}(U|V)(V|U)+' ## match pattern begin with U or V
## followed by at least one U or V
for example , I modify your data to insert longer pattern
ll <- list('Protein1 ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'Protein2 ABCDEFGHIJKUVMNOPQRSTUUVWXYZ',
'Protein3 ABCUVDEFGHIJUVVKLMNOPQRSTVUUUW')
sapply(ll, function(x) regmatches(x,gregexpr(pattern, x)))
[[1]]
[1] "PQRSTUV"
[[2]]
[1] "GHIJKUV" "PQRSTUUV"
[[3]]
[1] "FGHIJUVV" "PQRSTVUUU"
Upvotes: 2
Reputation: 263352
For numbers of matches:
> sapply( strsplit(dat[[2]], "UU|UV"), length) -1
[1] 0 2 1
To isolate the sequences, check to see which of the results are not the same number of characters as the input:
> sub("(.+)(.{5}UU|.{5}UV)(.+)", "\\2", dat[[2]])
[1] "ABCDEFGHIJKLMNOPQRSTUWXYZ" "PQRSTUV" "ABCUVDEFGHIJKLMNOPQRSTVVW"
To bind them together:
> apply(dat, 1, function(x) list(count=sapply( strsplit(x[2], "UU|UV"), length) -1 , matches= { mat <- gsub("(.+)(.{5}UU|.{5}UV)(.+)", "\\2", x[2]); if(!nchar(mat) ==nchar(x[2]) ) {mat}else{""} }))
[[1]]
[[1]]$count
V2
0
[[1]]$matches
[1] ""
[[2]]
[[2]]$count
V2
2
[[2]]$matches
V2
"PQRSTUV"
[[3]]
[[3]]$count
V2
1
[[3]]$matches
[1] ""
Upvotes: 3