Zero
Zero

Reputation: 35

How to extract and trim the fasta sequence using biopython

Hellow everybody, I am new to python struggling to do a small task using biopython.I have two file- one containing list of ids and associated number.eg

id.txt

tr_F6LMO6_F6LMO6_9LE  25
tr_F6ISE0_F6ISE0_9LE  17
tr_F6HSF4_F6HSF4_9LE  27
tr_F6PLK9_F6PLK9_9LE  19
tr_F6HOT8_F6HOT8_9LE  29

Second file containg a large fasta sequences.eg below

fasta_db.fasta

    >tr|F6LMO6|F6LMO6_9LEHG Transporter
    MLAPETRRKRLFSLIFLCTILTTRDLLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTN
    PANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPIT
    EKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV

   >tr|F6ISE0|F6ISE0_9LEHG peptidase domain protein OMat str.  
    MPILKVAFVSFVLLVFSLPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASS
    GTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK

   >tr|F6HSF4|F6HSF4_9LEHG hypothetical protein,  
    MNLRSYIREIQVGLLCILVFLMSLYLLYFESKSRGASVKEILGNVSFRYKTAQRKFPDRM
    LWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS


   >tr|F6PLK9|F6PLK9_9LEHG Uncharacterized protein mano str. 
   MRKITGSYSKISLLTLLFLIGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRD
   RKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG

I wanted to match the id from the first file with second file and print those matched seq in a new file after removing the length(from 1 to 25, in eq) .

Eg output[ 25(associated value with id,first file), aa removed from start, when id matched].

fasta_pruned.fasta

>tr|F6LMO6|F6LMO6_9LEHG Transporter     
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSL
PYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYV
PGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV

Biopython cookbook was way above my head being new to python programming.Thanks for any help you can give.

I tried and messed up. Here is it.

from Bio import SeqIO
from Bio import Seq

f1 = open('fasta_pruned.fasta','w')


lengthdict = dict() 
with open("seqid_len.txt") as seqlengths:
    for line in seqlengths:
        split_IDlength  = line.strip().split(' ')
        lengthdict[split_IDlength[0]] = split_IDlength[1]


with open("species.fasta","rU") as spe:
    for record in SeqIO.parse(spe,"fasta"):
        if record[0] == '>' :
            split_header = line.split('|')
            accession_ID = split_header[1]
            if accession_ID in lengthdict:
                f1.write(str(seq_record.id) + "\n")
                f1.write(str(seq_record_seq[split_IDlength[1]-1:]))



f1.close()

Upvotes: 2

Views: 1353

Answers (1)

Maximilian Peters
Maximilian Peters

Reputation: 31709

Your code has almost everything except for a couple of small things which prevent it from giving the desired output:

  • Your file id.txt has two spaces between the id and the location. You take the 2nd element which would be empty in this case.
  • When the file is read it is interpreted as a string but you want the position to be an integer

    lengthdict[split_IDlength[0]] = int(split_IDlength[-1])
    
  • Your ids are very similar but not identical, the only identical part is the 6 character identifier which could be used to map the two files (double check that before you assume it works). Having identical keys makes mapping much easier.

    f1 = open('fasta_pruned.fasta', 'w')
    
    fasta = dict()
    with open("species.fasta","rU") as spe:
        for record in SeqIO.parse(spe, "fasta"):
            fasta[record.id.split('|')[1]] = record
    
    lengthdict = dict() 
    with open("seqid_len.txt") as seqlengths:
        for line in seqlengths:
            split_IDlength  = line.strip().split(' ')
            lengthdict[split_IDlength[0].split('_')[1]] = int(split_IDlength[1])
    
    for k, v in lengthdict.items():
        if fasta.get(k) is None:
            continue
        print('>' + k)
        print(fasta[k].seq[v:])
        f1.write('>{}\n'.format(k))
        f1.write(str(fasta[k].seq[v:]) + '\n')
    
    f1.close()
    

Output:

>F6LMO6
LLSVGIFQPSHNARYGGMGGTNLAIGGSPMDIGTNPANLGLSSKKELEFGVSLPYIRSVYTDKLQDPDPNLAYTNSQNYNVLAPLPYIAIRIPITEKLTYGGGVYVPGGGNGNVSELNRATPNGQTFQNWSGLNISGPIGDSRRIKESYSSTFYV
>F6ISE0
LPSFAEEKTDFDGVRKAVVQIKVYSQAINPYSPWTTDGVRASSGTGFLIGKKRILTNAHVVSNAKFIQVQRYNQTEWYRVKILFIAHDCDLAILEAEDGQFYK
>F6HSF4
YFESKSRGASVKEILGNVSFRYKTAQRKFPDRMLWEDLEQGMSVFDKDSVRTDEASEAVVHLNSGTQIELDPQSMVVLQLKENREILHLGEGS
>F6PLK9
IGFTVLQSETNSFSLSSFTLRDLRLQKSESGNNFIELSPRDRKQGGELFFDFEEDEASNLQDKTGGYRVLSSSYLVDSAQAHTGKRSARFAGKRSGIKISG
>F6HOT8

Upvotes: 2

Related Questions