Finding matching motifs on sequence and their positions

Question

I am trying to find some matching motifs on a sequence, as well as the position that the motif is located in and then output that into a fasta file. The code below shows that the motif [L**L*L] is present in the sequence, when I run it returns as "YES" but I do not know where it is positioned

The ** inside the square bracket is to show that any amino acid there is permited. `

This is the code I used to check whether the motif is present in the sequence, and it worked because it returned "YES".

peptide1= "MKFSNEVVHKSMNITEDCSALTGALLKYSTDKSNMNFETLYRDAAVESPQHEVSNESGSTLKEHDYFGLSEVSSSNSSSGKQPEKCCREELNLNESATTLQLGPPAAVKPSGHADGADAHDEGAGPENPAKRPAHHMQQESLADGRKAAAEMGSFKIQRKNILEEFRAMKAQAHMTKSPKPVHTMQHNMHASFSGAQMAFGGAKNNGVKRVFSEAVGGNHIAASGVGVGVREGNDDVSRCEEMNGTEQLDLKVHLPKGMGMARMAPVSGGQNGSAWRNLSFDNMQGPLNPFFRKSLVSKMPVPDGGDSSANASNDCANRKGMVASPSVQPPPAQNQTVGWPPVKNFNKMNTPAPPASTPARACPSVQRKGASTSSSGNLVKIYMDGVPFGRKVDLKTNDSYDKLYSMLEDMFQQYISGQYCGGRSSSSGESHWVASSRKLNFLEGSEYVLIYEDHEGDSMLVGDVPWELFVNAVKRLRIMKGSEQVNLAPKNADPTKVQVAVG"

if re.search(r"L*L*L", peptide1):
        print("YES")
else: 
    print("NO")

The code that I wrote to find the position is below, but when I run it says invalid syntax. Could you please assist as I have no clue whether in the right track or not, as I am still new in the field and python.

 for position in range(len(s)):
    if peptide[position:].startswith(r"L*L*L"):
        print(position+1)

I am expecting to see the position of these motifs has been identified, for example the output should state whether the motif is found in position [2, 10] or any other number. This is just random posiitions that I chose since I dont know where this is positioned

Ghoti · Accepted Answer

You can use re.finditer() to search for multiple regex pattern matches within a string. Your peptide1 example does not contain an "L*L*L" motif, so I designated a random simple string as a demo.

simple_demo_string = "ABCLXLYLZLABC" # use a simple string to demonstrate code

The demo string contains two overlapping motifs. Normally, regex matches do not account for overlap

Example 1

simple_regex = "L.L.L" # in regex, periods are match-any wildcards

for x in re.finditer(simple_regex, simple_demo_string):
    print( x.start(), x.end(), x.group() )

# Output: 3 8 LXLYL

However, if you use a capturing group inside a lookahead, you'll be able to get everything even if there's overlap.

Example 2

lookahead_regex = "(?=(L.L.L))"

for x in re.finditer(lookahead_regex, simple_demo_string):
    # note - x.end() becomes same as x.start() due to lookahead 
    # but can be corrected by simply adding length of match
    print( x.start(), x.start()+len(x.group(1)), x.group(1) )

# Output: 3 8 LXLYL
#.        5 10 LYLZL

Finding matching motifs on sequence and their positions

Answers (1)

Related Questions