Reputation: 117
I am trying to find some matching motifs on a sequence, as well as the position that the motif is located in and then output that into a fasta file. The code below shows that the motif [L**L*L] is present in the sequence, when I run it returns as "YES" but I do not know where it is positioned
The ** inside the square bracket is to show that any amino acid there is permited. `
This is the code I used to check whether the motif is present in the sequence, and it worked because it returned "YES".
peptide1= "MKFSNEVVHKSMNITEDCSALTGALLKYSTDKSNMNFETLYRDAAVESPQHEVSNESGSTLKEHDYFGLSEVSSSNSSSGKQPEKCCREELNLNESATTLQLGPPAAVKPSGHADGADAHDEGAGPENPAKRPAHHMQQESLADGRKAAAEMGSFKIQRKNILEEFRAMKAQAHMTKSPKPVHTMQHNMHASFSGAQMAFGGAKNNGVKRVFSEAVGGNHIAASGVGVGVREGNDDVSRCEEMNGTEQLDLKVHLPKGMGMARMAPVSGGQNGSAWRNLSFDNMQGPLNPFFRKSLVSKMPVPDGGDSSANASNDCANRKGMVASPSVQPPPAQNQTVGWPPVKNFNKMNTPAPPASTPARACPSVQRKGASTSSSGNLVKIYMDGVPFGRKVDLKTNDSYDKLYSMLEDMFQQYISGQYCGGRSSSSGESHWVASSRKLNFLEGSEYVLIYEDHEGDSMLVGDVPWELFVNAVKRLRIMKGSEQVNLAPKNADPTKVQVAVG"
if re.search(r"L*L*L", peptide1):
print("YES")
else:
print("NO")
The code that I wrote to find the position is below, but when I run it says invalid syntax. Could you please assist as I have no clue whether in the right track or not, as I am still new in the field and python.
for position in range(len(s)):
if peptide[position:].startswith(r"L*L*L"):
print(position+1)
I am expecting to see the position of these motifs has been identified, for example the output should state whether the motif is found in position [2, 10] or any other number. This is just random posiitions that I chose since I dont know where this is positioned
Upvotes: 0
Views: 858
Reputation: 759
You can use re.finditer()
to search for multiple regex pattern matches within a string. Your peptide1
example does not contain an "L*L*L" motif, so I designated a random simple string as a demo.
simple_demo_string = "ABCLXLYLZLABC" # use a simple string to demonstrate code
The demo string contains two overlapping motifs. Normally, regex matches do not account for overlap
Example 1
simple_regex = "L.L.L" # in regex, periods are match-any wildcards
for x in re.finditer(simple_regex, simple_demo_string):
print( x.start(), x.end(), x.group() )
# Output: 3 8 LXLYL
However, if you use a capturing group inside a lookahead, you'll be able to get everything even if there's overlap.
Example 2
lookahead_regex = "(?=(L.L.L))"
for x in re.finditer(lookahead_regex, simple_demo_string):
# note - x.end() becomes same as x.start() due to lookahead
# but can be corrected by simply adding length of match
print( x.start(), x.start()+len(x.group(1)), x.group(1) )
# Output: 3 8 LXLYL
#. 5 10 LYLZL
Upvotes: 1