Reputation: 31
I am trying to retrieve particular RNA UTR sequences from a database. From the data base I got a .dat file in which each RNA UTR is represented by an entry like this:
ID 5MMUR018955; SV 1; linear; mRNA; STD; MUS; 54 BP. XX AC BR058092; XX DT 01-JUL-2009 (Rel. 9, Created) DT 01-JUL-2009 (Rel. 9, Last updated, Version 1) XX DE 5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA. XX DR REFSEQ; NM_008694; DR UTRef; CR062409; DR GeneID; 18054; XX OS Mus musculus (house mouse) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; OC Muridae; Murinae; Mus; Mus. XX UT 5'UTR; 1 exon(s) XX FH Key Location/Qualifiers FH FT source 1..54 FT /organism="Mus musculus" FT /mol_type="mRNA" FT /strain="C57BL/6" FT /db_xref="taxon:10090" FT 5'UTR 1..54 FT /source="REFSEQ::NM_008694:1..54" FT /gene="Ngp" FT /product="neutrophilic granule protein" FT /gene_synonym="bectenecin" FT /genome="chr9:110322312-110322365:+" XX SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54 //
I have a list of gene names (stored in the line FT /gene="Ngp"
), which I want to use to access the sequence stored in the line SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other;
agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54
Following retrieval, I would like to write both into a new file in fasta format, i.e.
>Ngp
agtctcaatatcatctacataaaaggggccaagagtggtagtgtgtcagagaca"
Is there any simple way to do this in python? I have been fighting with it all day and haven't really gotten anywhere and would really appreciate your help.
Upvotes: 1
Views: 758
Reputation: 318
Here's a robust solution that searches the database file for a list of genes, prints the results in fasta
format and lists the genes that are not found.
Note that there may be multiple records of sequences for the same gene name in the database so you might need additional filtering to get exactly those sequences that you wish to obtain.
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
data = "embl.dat" #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms
#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines()
found_genes = []
#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
for utr5feature in UTR5:
for s in search_genes:
genes = utr5feature.qualifiers['gene']
if s in genes:
found_genes.append(s)
#Gene found. Print a modified copy of the record in the desired format
print SeqRecord(record.seq, id="_".join(genes), name=record.name,
description=record.description).format('fasta')
#List any search terms that were not found in the database
for s in search_genes:
if s not in found_genes:
print s+" NOT FOUND IN DATABASE!"
Upvotes: 1
Reputation: 8164
i parse the embl file with biopython and extract information
from Bio import SeqIO
input = "test.embl" #change your input, here
#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()
UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]
#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']
#Create SeqRecord
from Bio.SeqRecord import SeqRecord
#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes),
name=seq.name, description=seq.description)
print new_record.format("fasta")
you get:
>Ngp 5'UTR in Mus musculus neutrophilic granule protein (Ngp), mRNA. AGTCTCAATATCATCTACATAAAAGGGGCCAAGAGTGGTAGTGTGTCAGAGACA
Upvotes: 0
Reputation: 3395
This is EMBL format:
ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt
You can extract information from this using BioPython SeqIO:
http://biopython.org/wiki/SeqIO
Upvotes: 1