hari
hari

Reputation: 61

using Python to retrieve missing sequences -'split' command does not work

I have a set of (protein)sequences that has been found using a software but they are shorter in length than that of the original ones in the database.I downloaded the entire database ,and now i have these set of incomplete sequences that have been found and the original database from which the sequences have been found.

Example result from software:

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE

Sequence in the database:

>tr|E7EWP2|E7EWP2_HUMAN  Uncharacterized protein OS=Homo sapiens GN=TRIO PE=4 SV=2
ARRKEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVEEIP

So the missing residues are 'ARR' and in the end 'EIP', I have around 70 incomplete sequences like this? I would like to write a Python program that can automatically retrieve the complete sequences from the database. I am really new to python ,ofcourse i will try to write my own code ,i would like to know if there are any libraries or something like biopython modules that can do this. My plan is to take the intervals from my result,expand them and select it on the original database,but i do not know how to proceed further.

i would like to get list_seq = [ARR,KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE,EIP] so that i can further use list_seq[0] r.strip(3) and list_seq[1] l.strip[3] so that i get the complete sequence. but list_seq does not work.

Thanks in advance

Upvotes: 1

Views: 235

Answers (1)

joaquin
joaquin

Reputation: 85603

You can use the index method from BioPython SeqIO. The index method indexes database records by protein id and doesn't load the full database in memory thus allowing efficient search with complete, big databases (alternatively you could use a conventional dbase like sqlite to first store your records and perform searches on it):

from Bio import SeqIO

db1 = "dbase.fasta"
db2 = "my_collection.fasta"

dbase_dict = SeqIO.index(db1, "fasta")
my_record_dict = SeqIO.index(db2, "fasta")

for record in my_record_dict:
    if record in dbase_dict:
        rec_dbase = dbase_dict[record]
        rec_mine = my_record_dict[record]
        if rec_dbase.seq != rec_mine.seq:
            print rec_dbase

This program just print the records with differences. From this point you can save them on a list or write to a fasta file

Upvotes: 4

Related Questions