Reputation: 378
sp|P46531|NOTC1_HUMAN Neurogenic locus notch homolog protein 1 OS=Homo sapiens GN=NOTCH1 PE=1 SV=4 MPPLLAPLLCLALLP
I have a fasta file and I would like to search the file for the beginning of the amino acid sequence. It would be something like
aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for filename in file_list:
with open(filename,'r') as fh:
while True:
char = fh.read(1)
if char.upper() in aminoacids:
#look for the 4 characters directly after it
but if a character is found to be in the amino acid list and the four characters next to it are also in the list, then a string will be made starting with that character and going until there are no more characters. For example, I would like to iterate through the file looking for characters. If M is found, then I would like to look for the next four characters (PPLL). If those next four characters are amino acids, then I would like to create a string starting with M and continuing to the end of the file.
Upvotes: 0
Views: 79
Reputation: 78590
You can read in the file as a single string, and then search for a regular expression:
regex = re.compile("[%s]{5}.*" % "".join(aminoacids))
with open(filename, 'r') as fh:
s = fh.read()
aa_sequence = regex.findall(s)
if len(aa_sequence) > 0:
# an amino acid sequence was found
print aa_sequence[0]
This works because the regular expression that is constructed is:
[ACDEFGHIKLMNPQRSTVWY]{5}.*
which means "5 of these characters, followed by anything."
Note that if your amino acid string may span multiple lines, you'll need to remove the newlines first, with:
s = fh.read().replace('\n', '')
# or
s = "".join(s.readLines())
Upvotes: 2