user3470313
user3470313

Reputation: 323

Looking for the amino acids motifs within protein sequence

I have simple search engine consisted of dictionary where UniProt codes and sequences are included for several entries.

Eventually I'd like to find some motifs in all these sequences and print its location (only start amino acid ) in each sequence.

For simple motifs I've done such task using below code

#Simple definition of the motif 
motif='AA'

for u, seq in dict.iteritems():
    for i in range(len(seq)):
        if seq[i:].startswith(motif):
            print "%s has been found in %d position of %s"%(motif, i+1, u)
            continue

where my dict is something like

>>> dict
{'P07204_TRBM_HUMAN': 'MLGVLVLGALALAGLGFPAPAEPQPGGSQCVEHDCFALYPGPATFLNASQICDGLRGHLMTVRSSVAADVISLLLNGDGGVGRRRLWIGLQLPPGCGDPKRLGPLRGFQWVTGDNNTSYSRWARLDLNGAPLCGPLCVAVSAAEATVPSEPIWEEQQCEVKADGFLCEFHFPATCRPLAVEPGAAAAAVSITYGTPFAARGADFQALPVGSSAAVAPLGLQLMCTAPPGAVQGHWAREAPGAWDCSVENGGCEHACNAIPGAPRCQCPAGAALQADGRSCTASATQSCNDLCEHFCVPNPDQPGSYSCMCETGYRLAADQHRCEDVDDCILEPSPCPQRCVNTQGGFECHCYPNYDLVDGECVEPVDPCFRANCEYQCQPLNQTSYLCVCAEGFAPIPHEPHRCQMFCNQTACPADCDPNTQASCECPEGYILDDGFICTDIDECENGGFCSGVCHNLPGTFECICGPDSALARHIGTDCDSGKVDGGDSGSGEPPPSPTPGSTLTPPAVGLVHSGLLIGISIASLCLVVALLALLCHLRKKQGAARAKMEYKCAAPSKEVVLQHVRTERTPQRL', 'B5ZC00': 'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK', 'A2Z669': 'MRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR', 'P20840_SAG1_YEAST': 'MFTFLKIILWLFSLALASAININDITFSNLEITPLTANKQPDQGWTATFDFSIADASSIREGDEFTLSMPHVYRIKLLNSSQTATISLADGTEAFKCYVSQQAAYLYENTTFTCTAQNDLSSYNTIDGSITFSLNFSDGGSSYEYELENAKFFKSGPMLVKLGNQMSDVVNFDPAAFTENVFHSGRSTGYGSFESYHLGMYCPNGYFLGGTEKIDYDSSNNNVDLDCSSVQVYSSNDFNDWWFPQSYNDTNADVTCFGSNLWITLDEKLYDGEMLWVNALQSLPANVNTIDHALEFQYTCLDTIANTTYATQFSTTREFIVYQGRNLGTASAKSSFISTTTTDLTSINTSAYSTGSISTVETGNRTTSEVISHVVTTSTKLSPTATTSLTIAQTSIYSTDSNITVGTDIHTTSEVISDVETISRETASTVVAAPTSTTGWTGAMNTYISQFTSSSFATINSTPIISSSAVFETSDASIVNVHTENITNTAAVPSEEPTFVNATRNSLNSFCSSKQPSSPSSYTSSPLVSSLSVSKTLLSTSFTPSVPTSNTYIKTKNTGYFEHTALTTSSVGLNSFSETAVSSQGTKIDTFLVSSLIAYPSSASGSQLSGIQQNFTSTSLMISTYEGKASIFFSAELGSIIFLLLSYLLF'}

This print all AA motif positions along all three sequence.

Now I'd like to find complex motifs along these sequences using RE.

# search complex motifs using regular expressions
for u, seq in dict.iteritems():
        m=re.search(r"N[^P](S|T)[^P]", seq[:])
        if re.search(r"N[^P](S|T)[^P]", seq[:]):
            print "%s has been found at the %s position in %s"%(m.group(), str(m.start()+1), u)
            continue

Using this code I can detect motif only one time along for sequence. How should I define addition FOR Loop more accuracy to obtain results as in the first case assuming that each motif can be several times in each sequence ?

Upvotes: 1

Views: 2395

Answers (3)

user3470313
user3470313

Reputation: 323

Thanks for suggestion!

Unfortunately all examples wiyh WHILE loops have produced infinitive loops with wrong results.

I've been solved this issue ussing re.match method and my first algorithm. How I can increase efficacy of such looping

for u, seq in dict.iteritems():
    for i in range(len(seq)):
        if re.match(motif, seq[i:]):
            print "%s has been found in %d position of %s"%(motif, i+1, u)          
            found[u]=i+1
            continue

also I have problem with the found dictionary which defined in this loop and should append values (positions of found motif for each Uniprot code (keys). Below you can see that After looping only last position for each key have been appened althought motifs have been found in several positions

{'P07204_TRBM_HUMAN': 409, 'B5ZC00': 395, 'P20840_SAG1_YEAST': 614}

Also how its possible to present motif=re.compile(r"N^P[^P]") in explicit form. Below you can see some wrong in results where in first place motifs should be defined

<_sre.SRE_Pattern object at 0x7f4ee5b11b70> has been found in 364 position of P20840_SAG1_YEAST
<_sre.SRE_Pattern object at 0x7f4ee5b11b70> has been found in 402 position of P20840_SAG1_YEAST
<_sre.SRE_Pattern object at 0x7f4ee5b11b70> has been found in 485 position of P20840_SAG1_YEAST
<_sre.SRE_Pattern object at 0x7f4ee5b11b70> has been found in 501 position of P20840_SAG1_YEAST
<_sre.SRE_Pattern object at 0x7f4ee5b11b70> has been found in 614 position of P20840_SAG1_YEAST

Many thanks for help

Upvotes: 0

loopbackbee
loopbackbee

Reputation: 23322

If you want to find all the occurrences, you simply need to use findall instead of search. It returns a list of results instead of a single result.

Also, you're doing the simple motif search in a way that is much slower than it needs to be. Instead of partitioning the string multiple times (seq[i:]) and using startswith on the partition, consider using string.index on the whole string multiple times:

motif='AA'

for u, seq in dict.iteritems():
    i=-1 #start search at the beginning of the sequence
    while True:
        try:
            i= seq.index(motif, i+1) #get the index of the next occurrence
            print "%s has been found in %d position of %s"%(motif, i+1, u)
        except ValueError:
            break #no more motifs found

Upvotes: 1

Math
Math

Reputation: 2436

You could repeat your research on subsequences :

for u, seq in dict.iteritems():
    start = 0;
    m=re.search(r"N[^P](S|T)[^P]", seq[start:])
    while (m) :
        print "%s has been found at the %s position in %s"%(m.group(), str(m.start()+1), u)
        start = m.start()
        m=re.search(r"N[^P](S|T)[^P]", seq[start:])

It won't work if your motif overlaps with itself (ie if you look for AEA in AEAEA you would only get (AEA)EA but not AE(AEA)), in which case you need a more precise research.

Upvotes: 0

Related Questions