Reputation: 617
I am trying to find a amino acid pattern (B-C or M-D, where '-' could be any alphabet other than 'P') in a protein sequence let say 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'. Protein sequence in in a fasta file.
I have tried a lot but couldn't find any solution.
I tried a lot. the following code is one of them
import Bio
from Bio import SeqIO
seqs= SeqIO.parse(X, 'fasta') ### to read the sequences from fasta file
for aa in seqs:
x=aa.seq ## gives the sequences as a string (.seq is a build in function of Biopython)
for val, i in enumerate(x):
if i=='B':
if (x[val+2])=='C':
if x[val+1]!='P':
pattern=((x[val]:x[val+2])) ## trying to print full sequence B-C
But unfortunately none of them work. It would be great if someone can help me out with this problem.
Upvotes: 1
Views: 1472
Reputation: 1224
In python you can use the Regex module (re
):
import re # import the RE module
import Bio
from Bio import SeqIO
seqs = SeqIO.parse(X, 'fasta')
for sequence in seqs:
line = sequence.se
RE = r'B[A-OQ-Z]C|M[A-OQ-Z]D'
# [A-OQ-Z] : Match from A to O and from Q to Z (exl. P)
# | : is an operator OR = either the left or right part should match
# The r before the string specify that the string is regex: r"regex"
results = re.findall(RE, line)
# The function findall will return a list of all non-overlapping matches.
# To iterate over each result :
for res in results:
print(res)
Then you can also modify the Regular expression to match any other rule you would like to match.
More information about the findall
function here : re.findall(...)
The following website can help you build a regex : https://regex101.com/
Upvotes: 2
Reputation: 398
Use a regular expression with an exception assertion "^".
import re
string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)
Output:
['BAC', 'MLD']
Upvotes: 1
Reputation: 1506
A common solution for pattern matching is the usage of regex.
A possible regex for your problem is B[^P]C|M[^P]D
.
The following code has been generated by regex101 with the regex I propose and the test string you gave us. It find all matching pattern with their positions in the original string.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"B[^P]C|M[^P]D"
test_str = "VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Upvotes: 2
Reputation: 405
>>> x = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
>>> import re
>>> m = re.search('B(.+?)C', x)
>>> m
<_sre.SRE_Match object at 0x10262aeb0>
>>> m = re.search('B(.+?)C', x).group(0)
>>> m
'BAC'
>>> m = re.search('M(.+?)D', x).group(0)
>>> m
'MLD'
>>> re.search(r"(?<=M).*?(?=D)", x).group(0)
'L'
>>> re.search(r"(?<=B).*?(?=C)", x).group(0)
'A'
Upvotes: 3