Reputation: 91
I am currently trying to use the findall function in re to capture amino acid sequences for proteins. I am having trouble getting the syntax with the regular expression to work. Here is a simplified part of the code I am struggling with:
import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element IS1 9 protein InsA OS=Escherichia coli (strain K12) OX=83333 GN=insA9 PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA"
result=re.findall(r'SV=(\d{1})\s{1}[A-Z]*', line)
for item in result:
print(item)
I would like it to return the letter sequence following SV=1, but it returns "1" and not "MASVSISC..." I'm confused as to why. I feel my code reads as "SV followed by some single digit, a single space, and then an unspecified length sequence of capital letters." How can I get it to return the amino acid sequence?
I've tried a couple of different things. I figured maybe I was confusing the placement of "*" or using it in place of "+" by accident. However, I am still getting "1" for the following attempts:
result=re.findall(r'SV=(\d{1})\s{1}[A-Z*]', line)
result=re.findall(r'SV=(\d{1})\s{1}[A-Z]+', line)
result=re.findall(r'SV=(\d{1})\s{1}[A-Z+]', line)
Upvotes: 0
Views: 85
Reputation: 6613
I think you might be able to parse the amino acids without using a regex. Perhaps the following could be used:
rspace = line.rindex(' ')
seq = line[rspace+1:]
Upvotes: 1