mitochondrion
mitochondrion

Reputation: 61

Python 'for loop' to parse results

I am a beginning python user (trying to learn for bioinformatics) and I am having difficulties in getting my final 'for loop' correct. I have used a web-based bioinformatic program to assess the subcellular localization of certain proteins (protein names and sequences contained within ORFs) and I am trying to parse the results (contained within targetp). The web-based program that I've used truncates the names of the proteins (and does not include sequences), and I would like to parse my results file such that I have the complete name and sequence of each protein in FASTA format (this entails having a '>' + the protein name on one line, and the protein sequence on the subsequent line). I think that everything is going well until the last block of code; I end up with the proper protein names, but they are all appended to the same sequence. I know that there must be something simple that I am doing wrong, but I just can't figure it out. Any ideas?

Thanks!

The ORFs file looks like this (it's FASTA, but the " shouldn't be there, only >):

">HsaNP_000700 branched chain keto acid dehydrogenase E1, alpha polypeptide MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK

">HsaNP_060914 pyruvate dehydrogenase phosphatase precursor MPAPTQLFFPLIRNCELSRIYGTACYCHHKHLCCSSSYIPQSRLRYTPHPAYATFCRPKENWWQYTQGRRYASTPQKFYLTPPQVNSILKANEYSFKVPEFDGKNVSSILGFDSNQLPANAPIEDRRSAATCLQTRGMLLGVFDGHAGCACSQAVSERLFYYIAVSLLPHETLLEIENAVESGRALLPILQWHKHPNDYFSKEASKLYFNSLRTYWQELIDLNTGESTDIDVKEALINAFKRLDNDISLEAQVGDPNSFLNYLVLRVAFSGATACVAHVDGVDLHVANTGDSRAMLGVQEEDGSWSAVTLSNDHNAQNERELERLKLEHPKSEAKSVVKQDRLLGLLMPFRAFGDVKFKWSIDLQKRVIESGPDQLNDNEYTKFIPPNYHTPPYLTAEPEVTYHRLRPQDKFLVLATDGLWETMHRQDVVRIVGEYLTGMHHQQPIAVGGYKVTLGQMHGLLTERRTKMSSVFEDQNAATHLIRHAVGNNEFGTVDHERLSKMLSLPEELARMYRDDITIIVVQFNSHVVGAYQNQE

The targetp file looks like this (the M is in position 57, but the formatting here throws this off):

HsaNP_000700 445 0.939 0.020 0.089 M 1
HsaNP_060914 537 0.309 0.073 0.629 _ 4

The leftmost column in targetp is the identifier (part of the header line in each protein sequence above), and I want to return only entries with an 'M' (i.e., not '_') in position 57, along with the protein name from ORFs (header line).

My script is:

#!/usr/bin/python

ORFs = open('Human.MitoCarta.fasta', 'U')
targetp = open('MitoCarta_TargetP_combined.out', 'U')
report = targetp.readlines()
protfile = open('mitocarta_no_mTP.fasta','w')
protid = []
seqdict = {}

for seq in ORFs:
    seq = seq.rstrip()
    if seq[0] == '':
        continue
    if seq[0] == '>':
        name = seq[1:]
        seqdict[name] = ''
        continue

    seqdict[name] += seq

for entry in report:
    if entry.startswith('HsaNP'):
        if entry[57] != 'M':
            protid.append(entry[0:20])
            protid = [x.strip(' ') for x in protid]


nameslist = seqdict.keys()
c = 0
for i in protid:
    if i in nameslist[c]:
        protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[name]))
        c += 1

protfile.close()

Upvotes: 1

Views: 202

Answers (1)

JGallo
JGallo

Reputation: 181

Yes, you are writing nameslist[c] and seqdict[name] but you never change 'name'. So you need to change 'name' if you want to get the different sequences. You should write:

protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[nameslist[c]]))

That way you should get it right.

Upvotes: 1

Related Questions