Reputation: 13
I have two protein sequences FASTA files:
nsp.fasta --> original file
wsp.fasta --> output file from a signal peptide predictor tool, which returned the proteins in nsp.fasta with the signal stripped.
For example:
record in nsp.fasta:
>gi|564250271|ref|XP_006264203.1| PREDICTED: apolipoprotein D [Alligator mississippiensis] MRGMLALLAALLGLLGLVEGQTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQER CVQANYSLKANGKIKVLTKMVRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYAL VYSCTSFFWLFHVDYAWIRSRTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM
record in wsp.fasta:
>gi|564250271|ref|XP|006264203.1| PREDICTED: apolipoprotein D [Alligator mississippiensis]; MatureChain: 21-179 QTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQERCVQANYSLKANGKIKVLTKM VRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYALVYSCTSFFWLFHVDYAWIRS RTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM
However, not all the proteins in nsp.fasta contained a signal peptide, so wsp.fasta is a subset of the proteins in nsp.fasta that contains the signal. What I need is a unique file that contains all the protein records, both proteins with no signal peptide found and the mature chains with the signal peptide stripped.
I have tried the following:
from Bio import SeqIO
file1 = SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")
file2 = SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")
for seq1 in file1:
for seq2 in file2:
if seq2.id == seq1.id:
seq1.seq = seq2.seq
SeqIO.write(seq1, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
But there's no output at all. I have tried putting the SeqIO.write out of the loops, and it returns a blank file. What am I doing wrong? There already exist any method to merge two files or to replace sequences in one file with sequences in other file?
Thank you in advance!!
Sergio
Edited code, I added an elif clause in an attempt to also add the records in nsp.fasta that doesn't match wsp.fasta, but it doesn't work:
to_write = []
for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.txt", "fasta"):
for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.txt", "fasta"):
if seq1.id == seq2.id:
seq1.seq = seq2.seq
to_write.append(seq1)
elif seq1.id != seq2.id:
to_write.append(seq1)
SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.txt", "fasta")
Upvotes: 1
Views: 432
Reputation: 6699
As you have written it, every time you write a new sequence, you're overwriting the previous one. Try storing your records in a list and then writing out the list when the loop is completed.
to_write = []
for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta"):
for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta"):
if seq2.id == seq1.id:
seq1.seq = seq2.seq
to_write.append(seq1)
SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
Edit to suggest another approach using list comprehensions:
ids_to_save = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")]
records_to_save = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id in ids_to_save)]
SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
Edit to address the "add the records in nsp.fasta that doesn't match wsp.fasta" need - general approach, not necessarily exact code:
ids_not_wanted = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")]
records_to_save_2 = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id not in ids_not_wanted)]
records_to_save.append(records_to_save_2)
# If duplicate records are a problem, eliminate them using "set"
records_to_save = list(set(records_to_save))
SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")
Upvotes: 1