KoSi
KoSi

Reputation: 31

Retrieving RNA UTR sequences from a database

I am trying to retrieve particular RNA UTR sequences from a database. From the data base I got a .dat file in which each RNA UTR is represented by an entry like this:

ID   5MMUR018955; SV 1; linear; mRNA; STD; MUS; 54 BP.
XX
AC   BR058092;
XX
DT   01-JUL-2009 (Rel. 9, Created)
DT   01-JUL-2009 (Rel. 9, Last updated, Version 1)
XX
DE   5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA.
XX
DR   REFSEQ; NM_008694;
DR   UTRef; CR062409;
DR   GeneID; 18054;
XX
OS   Mus musculus (house mouse)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea;
OC   Muridae; Murinae; Mus; Mus.
XX
UT   5'UTR; 1 exon(s)
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..54
FT                   /organism="Mus musculus"
FT                   /mol_type="mRNA"
FT                   /strain="C57BL/6"
FT                   /db_xref="taxon:10090"
FT   5'UTR           1..54
FT                   /source="REFSEQ::NM_008694:1..54"
FT                   /gene="Ngp"
FT                   /product="neutrophilic granule protein"
FT                   /gene_synonym="bectenecin"
FT                   /genome="chr9:110322312-110322365:+"
XX
SQ   Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other;
     agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca              54
//

I have a list of gene names (stored in the line FT /gene="Ngp"), which I want to use to access the sequence stored in the line SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54

Following retrieval, I would like to write both into a new file in fasta format, i.e.

>Ngp
agtctcaatatcatctacataaaaggggccaagagtggtagtgtgtcagagaca"

Is there any simple way to do this in python? I have been fighting with it all day and haven't really gotten anywhere and would really appreciate your help.

Upvotes: 1

Views: 758

Answers (3)

nwod
nwod

Reputation: 318

Here's a robust solution that searches the database file for a list of genes, prints the results in fasta format and lists the genes that are not found.

Note that there may be multiple records of sequences for the same gene name in the database so you might need additional filtering to get exactly those sequences that you wish to obtain.

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

data = "embl.dat"  #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms

#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines() 
found_genes = []

#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
    UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
    for utr5feature in UTR5:
        for s in search_genes:
            genes = utr5feature.qualifiers['gene']
            if s in genes:
                found_genes.append(s)
                #Gene found. Print a modified copy of the record in the desired format
                print SeqRecord(record.seq, id="_".join(genes), name=record.name,
                                description=record.description).format('fasta')

#List any search terms that were not found in the database
for s in search_genes:
    if s not in found_genes:
        print s+" NOT FOUND IN DATABASE!"

Upvotes: 1

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8164

i parse the embl file with biopython and extract information

from Bio import SeqIO

input = "test.embl"  #change your input, here

#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()

UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]

#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']

#Create SeqRecord
from Bio.SeqRecord import SeqRecord

#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes), 
             name=seq.name, description=seq.description)

print new_record.format("fasta")

you get:

>Ngp 5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA.
AGTCTCAATATCATCTACATAAAAGGGGCCAAGAGTGGTAGTGTGTCAGAGACA

Upvotes: 0

Vince
Vince

Reputation: 3395

This is EMBL format:

ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

You can extract information from this using BioPython SeqIO:

http://biopython.org/wiki/SeqIO

Upvotes: 1

Related Questions