How do I capture a series of letters using the findall function with re in python?

Question

I am currently trying to use the findall function in re to capture amino acid sequences for proteins. I am having trouble getting the syntax with the regular expression to work. Here is a simplified part of the code I am struggling with:

import re
line=">sp|A0A385XJ53|INSA9_ECOLI Insertion element IS1 9 protein InsA OS=Escherichia coli (strain K12) OX=83333 GN=insA9 PE=3 SV=1 MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFTYTASQPGTHQKIIDMA"
result=re.findall(r'SV=(\d{1})\s{1}[A-Z]*', line)
for item in result:
    print(item)

I would like it to return the letter sequence following SV=1, but it returns "1" and not "MASVSISC..." I'm confused as to why. I feel my code reads as "SV followed by some single digit, a single space, and then an unspecified length sequence of capital letters." How can I get it to return the amino acid sequence?

I've tried a couple of different things. I figured maybe I was confusing the placement of "*" or using it in place of "+" by accident. However, I am still getting "1" for the following attempts:

result=re.findall(r'SV=(\d{1})\s{1}[A-Z*]', line)

result=re.findall(r'SV=(\d{1})\s{1}[A-Z]+', line)

result=re.findall(r'SV=(\d{1})\s{1}[A-Z+]', line)

How do I capture a series of letters using the findall function with re in python?

Answers (1)

Related Questions