Reputation: 35
Hellow everybody, I am new to python struggling to do a small task using biopython.I have two file- one containing list of ids and associated number.eg
id.txt
tr_F6LMO6_F6LMO6_9LE 25
tr_F6ISE0_F6ISE0_9LE 17
tr_F6HSF4_F6HSF4_9LE 27
tr_F6PLK9_F6PLK9_9LE 19
tr_F6HOT8_F6HOT8_9LE 29
Second file containg a large fasta sequences.eg below
fasta_db.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
MLAPETRRKRLFSLIFLCTILTTRDLLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTN
PANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPIT
EKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>tr|F6ISE0|F6ISE0_9LEHG peptidase domain protein OMat str.
MPILKVAFVSFVLLVFSLPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASS
GTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>tr|F6HSF4|F6HSF4_9LEHG hypothetical protein,
MNLRSYIREIQVGLLCILVFLMSLYLLYFESKSRGASVKEILGNVSFRYKTAQRKFPDRM
LWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>tr|F6PLK9|F6PLK9_9LEHG Uncharacterized protein mano str.
MRKITGSYSKISLLTLLFLIGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRD
RKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
I wanted to match the id from the first file with second file and print those matched seq in a new file after removing the length(from 1 to 25, in eq) .
Eg output[ 25(associated value with id,first file), aa removed from start, when id matched].
fasta_pruned.fasta
>tr|F6LMO6|F6LMO6_9LEHG Transporter
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSL
PYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYV
PGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
Biopython cookbook was way above my head being new to python programming.Thanks for any help you can give.
I tried and messed up. Here is it.
from Bio import SeqIO
from Bio import Seq
f1 = open('fasta_pruned.fasta','w')
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0]] = split_IDlength[1]
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe,"fasta"):
if record[0] == '>' :
split_header = line.split('|')
accession_ID = split_header[1]
if accession_ID in lengthdict:
f1.write(str(seq_record.id) + "\n")
f1.write(str(seq_record_seq[split_IDlength[1]-1:]))
f1.close()
Upvotes: 2
Views: 1353
Reputation: 31709
Your code has almost everything except for a couple of small things which prevent it from giving the desired output:
id.txt
has two spaces between the id and the location. You take the 2nd element which would be empty in this case.When the file is read it is interpreted as a string but you want the position to be an integer
lengthdict[split_IDlength[0]] = int(split_IDlength[-1])
Your ids are very similar but not identical, the only identical part is the 6 character identifier which could be used to map the two files (double check that before you assume it works). Having identical keys makes mapping much easier.
f1 = open('fasta_pruned.fasta', 'w')
fasta = dict()
with open("species.fasta","rU") as spe:
for record in SeqIO.parse(spe, "fasta"):
fasta[record.id.split('|')[1]] = record
lengthdict = dict()
with open("seqid_len.txt") as seqlengths:
for line in seqlengths:
split_IDlength = line.strip().split(' ')
lengthdict[split_IDlength[0].split('_')[1]] = int(split_IDlength[1])
for k, v in lengthdict.items():
if fasta.get(k) is None:
continue
print('>' + k)
print(fasta[k].seq[v:])
f1.write('>{}\n'.format(k))
f1.write(str(fasta[k].seq[v:]) + '\n')
f1.close()
Output:
>F6LMO6 LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV >F6ISE0 LPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASSGTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK >F6HSF4 YFESKSRGASVKEILGNVSFRYKTAQRKFPDRMLWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS >F6PLK9 IGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRDRKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG >F6HOT8
Upvotes: 2