shivam
shivam

Reputation: 617

find a Pattern Match in string in Python

I am trying to find a amino acid pattern (B-C or M-D, where '-' could be any alphabet other than 'P') in a protein sequence let say 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'. Protein sequence in in a fasta file.

I have tried a lot but couldn't find any solution.

I tried a lot. the following code is one of them

import Bio
from Bio import SeqIO

seqs= SeqIO.parse(X, 'fasta') ### to read the sequences from fasta file
for aa in seqs:
    x=aa.seq ## gives the sequences as a string (.seq is a build in function of Biopython)
    
    for val, i in enumerate(x):          
        
        if i=='B':    
            if (x[val+2])=='C':
                
                if x[val+1]!='P':
                   pattern=((x[val]:x[val+2])) ## trying to print full sequence B-C
                

But unfortunately none of them work. It would be great if someone can help me out with this problem.

Upvotes: 1

Views: 1472

Answers (4)

vinalti
vinalti

Reputation: 1224

In python you can use the Regex module (re):

import re      # import the RE module
import Bio
from Bio import SeqIO

seqs = SeqIO.parse(X, 'fasta')
for sequence in seqs:
    line = sequence.se

    RE = r'B[A-OQ-Z]C|M[A-OQ-Z]D'
    # [A-OQ-Z] : Match from A to O and from Q to Z (exl. P)
    # | : is an operator OR = either the left or right part should match
    # The r before the string specify that the string is regex:  r"regex"

    results = re.findall(RE, line)
    # The function findall will return a list of all non-overlapping matches.

    # To iterate over each result :
    for res in results:
        print(res)

Then you can also modify the Regular expression to match any other rule you would like to match.

More information about the findall function here : re.findall(...)

The following website can help you build a regex : https://regex101.com/

Upvotes: 2

Vladimir Vilimaitis
Vladimir Vilimaitis

Reputation: 398

Use a regular expression with an exception assertion "^".

import re

string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)

Output:

['BAC', 'MLD']

Upvotes: 1

Raida
Raida

Reputation: 1506

A common solution for pattern matching is the usage of regex.

A possible regex for your problem is B[^P]C|M[^P]D.

The following code has been generated by regex101 with the regex I propose and the test string you gave us. It find all matching pattern with their positions in the original string.

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"B[^P]C|M[^P]D"

test_str = "VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Upvotes: 2

user3327034
user3327034

Reputation: 405

>>> x = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
>>> import re
>>> m = re.search('B(.+?)C', x)
>>> m
<_sre.SRE_Match object at 0x10262aeb0>
>>> m = re.search('B(.+?)C', x).group(0)
>>> m
'BAC'
>>> m = re.search('M(.+?)D', x).group(0)
>>> m
'MLD'
>>> re.search(r"(?<=M).*?(?=D)", x).group(0)
'L'
>>> re.search(r"(?<=B).*?(?=C)", x).group(0)
'A'

Upvotes: 3

Related Questions