Froot Loops
Froot Loops

Reputation: 41

Python: find string in string

I want to print out the IDs that are in between ">sp|" and "|" from a file, so the output should be:

Q12955
Q16659
Q7Z7A1

Example file f:

>sp|Q12955|ANK3_HUMAN Ankyrin-3 OS=Homo sapiens GN=ANK3 PE=1 SV=3
MAHAASQLKKNRDLEINAEEEPEKKRKHRKRSRDRKKKSDANASYLRAARAGHLEKALDY
IKNGVDINICNQNGLNALHLASKEGHVEVVSELLQREANVDAATKKGNTALHIASLAGQA

>sp|Q16659|MK06_HUMAN Mitogen-activated protein kinase 6 OS=Homo sapiens GN=MAPK6 PE=1 SV=1

MAEKFESLMNIHGFDLGSRYMDLKPLGCGGNGLVFSAVDNDCDKRVAIKKIVLTDPQSVK
HALREIKIIRRLDHDNIVKVFEILGPSGSQLTDDVGSLTELNSVYIVQEYMETDLANVLE
QGPLLEEHARLFMYQLLRGLKYIHSANVLHRDLKPANLFINTEDLVLKIGDFGLARIMDP

>sp|Q7Z7A1|CNTRL_HUMAN Centriolin OS=Homo sapiens GN=CNTRL PE=1 SV=2

MKKGSQQKIFSKAKIPSSSHSPIPSSMSNMRSRSLSPLIGSETLPFHSGGQWCEQVEIAD
ENNMLLDYQDHKGADSHAGVRYITEALIKKLTKQDNLALIKSLNLSLSKDGGKKFKYIEN
LEKCVKLEVLNLSYNLIGKIEKLDKLLKLRELNLSYNKISKIEGIENMCNLQKLNLAGNE

My code:

f=open('seq.fasta','r')

for idline in f:
    ID = re.findall('|......|',idline)
    print ID
    break

Any help would be appreciated, thank you in advance!

Upvotes: 1

Views: 177

Answers (1)

heinst
heinst

Reputation: 8786

If the ID is always in the middle of the two vertical bars then you could do something like this and not even worry about regular expressions. (Judging by your example it is safe to assume they are always in the middle!)

f=open('seq.fasta','r')

for idline in f:
    if '>' in idline:
        lineSplit = idline.split('|')
        ID = lineSplit[1]
        print ID

Output

Q12955
Q16659
Q7Z7A1

If it does vary you could do something like this and loop through until you get the section beginning with Q and then print that. The two give you the same results.

f=open('seq.fasta','r')

for idline in f:
    if '>' in idline:
        lineSplit = idline.split('|')
        for section in lineSplit:
            if (('OS=' not in section) and ('>sp' not in section)):
                ID = section
                print ID

Upvotes: 1

Related Questions