Reputation: 141
I have this file containing 82 pairs of IDs:
EmuJ_000063620.1 EgrG_000063620.1 253 253
EmuJ_000065200.1 EgrG_000065200.1 128 128
EmuJ_000081200.1 EgrG_000081200.1 1213 1213
EmuJ_000096200.1 EgrG_000096200.1 295 298
EmuJ_000114700.1 EgrG_000114700.1 153 153
EmuJ_000133800.1 EgrG_000133800.1 153 153
EmuJ_000139900.1 EgrG_000144400.1 2937 2937
EmuJ_000164600.1 EgrG_000164600.1 167 167
and I have two other files with the sequences for EmuJ_*
IDs and EgrG_*
IDs as follows:
EgrG_sequences.fasta
:
>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*
and so on. The same for EmuJ_sequences.fasta
I need to get the sequences for each pair and write one after the other maintaining the order like this:
>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*
And so on.
I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way. I've tried this, but I've got only the first pair and then it stopped:
rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')
for l in rbh:
emid=l.split('\t')[0]
egid=l.split('\t')[1]
# ids=emid+'\n'+egid
# print ids # just to check if split worked
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
I've tried some variations but I've got only the IDs, without the sequences. So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)? Please, let me know if I explained it clearly.
Upvotes: 1
Views: 438
Reputation: 77892
Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems
and egs
files are exhausted.
The quick&dirty workaround would be to reset the ems
and egs
pointers to zero at the end of the outer loop, ie:
for line in rbh:
# no need to split twice
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
ems.seek(0) # reset the file pointer
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
egs.seek(0) # reset the file pointer
Note that calling next(iterator)
while already iterating over iterator
will consume one more item of the iterator, as illustrated here:
>>> it = iter(range(20))
>>> for x in it:
... print x, next(it)
...
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19
As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.
Now your algorithm is far from efficient - for each line of the rbh
file it will read scan the whole ems
and egs
files again and again.
_NB : the following assumes that each emid / egid will appear at most once in the fasta files._
If your ems
and egs
files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)
# warning: totally untested code
def fastamap(path):
d = dict()
with open(path) as f:
for num, line in enumerate(f, 1):
line = line.strip()
# skip empty lines.
if not line:
continue
# sanity check: we should only see
# lines starting with ">", the "value"
# lines being consumed by the `next(f)` call
if not line.startswith(">"):
raise ValueError(
"in file %s: line %s doesn't start with '>'" % (
path, num
))
# ok, proceed
d[line.lstrip(">")] = next(f).strip()
return d
ems = fastamap('em_seq.fasta')
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
for line in rhb:
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
if emid in ems:
print emid
print ems[emid]
if egid in egs:
print egid
print egs[egid]
If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:
for l in rbh:
# no need to split twice, you can just unpack
emid, egid = l.split('\t')
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
break # no need to go further
ems.seek(0) # reset the file pointer
# etc...
Upvotes: 1