Searching a string for a substring of characters in a list

Question

sp|P46531|NOTC1_HUMAN Neurogenic locus notch homolog protein 1 OS=Homo sapiens GN=NOTCH1 PE=1 SV=4 MPPLLAPLLCLALLP

I have a fasta file and I would like to search the file for the beginning of the amino acid sequence. It would be something like

aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for filename in file_list:
    with open(filename,'r') as fh:
        while True:
        char = fh.read(1)
        if char.upper() in aminoacids:
            #look for the 4 characters directly after it

but if a character is found to be in the amino acid list and the four characters next to it are also in the list, then a string will be made starting with that character and going until there are no more characters. For example, I would like to iterate through the file looking for characters. If M is found, then I would like to look for the next four characters (PPLL). If those next four characters are amino acids, then I would like to create a string starting with M and continuing to the end of the file.

David Robinson · Accepted Answer

You can read in the file as a single string, and then search for a regular expression:

regex = re.compile("[%s]{5}.*" % "".join(aminoacids))

with open(filename, 'r') as fh:
    s = fh.read()
    aa_sequence = regex.findall(s)
    if len(aa_sequence) > 0:
        # an amino acid sequence was found
        print aa_sequence[0]

This works because the regular expression that is constructed is:

[ACDEFGHIKLMNPQRSTVWY]{5}.*

which means "5 of these characters, followed by anything."

Note that if your amino acid string may span multiple lines, you'll need to remove the newlines first, with:

s = fh.read().replace('
', '')
# or
s = "".join(s.readLines())

Searching a string for a substring of characters in a list

Answers (1)

Related Questions