Reputation: 11
I am trying to get blastP results for a specific protein using NCBIWWW. The problem is that what gets sent back is not what I recognize as alignment data, what I do get is this (This is the contents of 'Blast_record' in the source code); I am using code obtained from the 'BioPython tutorial and cookbook', I have scoured it and the internet for the source of error but I just can't find it. My source code is this;
# biopython
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
# first get the sequence we want to parse from a FASTA file
# f_record = next(SeqIO.parse('m_cold.fasta', 'fasta'))
print('Doing the BLAST and retrieving the results...')
result_handle = NCBIWWW.qblast('blastp', 'tsa', '365176198')
# save the results for later, in case we want to look at it
save_file = open('m_cold_blast.out', 'w')
blast_results = result_handle.read()
save_file.write(blast_results)
save_file.close()
The resulting file is:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>tsa</BlastOutput_db>
<BlastOutput_query-ID>gi|365176198|gb|AEW67975.1|</BlastOutput_query-ID>
<BlastOutput_query-def>polyprotein [Black queen cell virus] </BlastOutput_query-def>
<BlastOutput_query-len>171</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>10</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|365176198|gb|AEW67975.1|</Iteration_query-ID>
<Iteration_query-def>polyprotein [Black queen cell virus]</Iteration_query-def>
<Iteration_query-len>171</Iteration_query-len>
<Iteration_hits>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
Now, if I perform a search using BlastN and the nucleotide sequence of the protein used above I get all the matched sequences, their E values and Scores etc. So why is this not the case when using BlastP?
I am very new to both Python and Biopython and for the life of me I cannot figure out what I am doing wrong.
Upvotes: 1
Views: 1308
Reputation: 98
I noticed your output file is ends in '.out'. Try saving it to an XML file, and it will map them to nice columns. In the first line of your output file, you'll see '?xml. The qBLAST function is uses XML by default, and has an optional argument for "Text" as well, though the formatting is a nightmare.
It is also true that blastp and tsa are different databases. The qblast module has some built in help that might help with different arguments, which can be accessed with this.
>>> from Bio.Blast import NCBIWWW
>>> help(NCBIWWW.qblast)
(I would run your code but qBLAST is having issues right at the time of this writing)
Upvotes: 0
Reputation: 8164
QUERY 365176198 is a protein
DATABASE are nucleotics
What is the Transcriptome Shotgun Assembly (TSA) Database?
TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs.
Are TSA sequences available by a BLAST search?
A Transcriptome Shotgun Assembly (TSA) BLAST database is now available. The sequences were initially included in nt but now have been segregated into a separate database. The TSA database is available from the BLAST home page under Basic BLAST at the nucleotide, tblastn, and tblastx links. These sequences are not available in nt.
BLAST FLAVORS
blastp: compares an amino acid query sequence against a protein sequence database blastn: compares a nucleotide query sequence against a nucleotide sequence database blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
MANUALLY ANSWER
BIOPYTHON ANSWER
you must use "tsa_nt" instead of "tsa", and "tblastn" instead of "blastp"
query = '365176198'
#note: this may take several minutes
result_handle = NCBIWWW.qblast('tblastn', 'tsa_nt', query, format_type="Text")
you get:
....... Score E Sequences producing significant alignments: (Bits) Value gb|GAZV01037943.1| Apis mellifera comp13466_c0_seq1 transcrib... 342 4e-105 gb|GAZF01116856.1| Essigella californica C629542 transcribed ... 179 1e-54 gb|GBYB01008381.1| Fopius arisanus c20283_g1_i1 transcribed R... 149 2e-37 gb|GAUO01000423.1| Velia caprai s423_L_1942_0 transcribed RNA... 58.9 3e-07 gb|GAXG01028220.1| Gynaikothrips ficorum s28263_L_292921_0 tr... 57.0 9e-07 gb|GAWP01023404.1| Grylloblatta bifratrilecta s23438_L_295244... 52.8 4e-05 gb|GAWZ01143177.1| Gryllotalpa sp. AD-2013 C589197 transcribe... 45.4 0.002 gb|GAXW01013938.1| Euroleon nostras s13984_L_116369_0 transcr... 45.8 0.006 gb|GAXC01050700.1| Thrips palmi C235436 transcribed RNA sequence 42.0 0.017 gb|GAXH01037906.1| Parides eurimedes C235744 transcribed RNA ... 40.4 0.069 gb|GBES01007135.1| Dichelops melacanthus Locus_17334_Transcri... 39.7 0.18 gb|GBXI01014067.1| Bactrocera cucurbitae c16593_g1_i1 transcr... 40.4 0.49 gb|GAMC01001920.1| Ceratitis capitata comp55379_c0_seq1 mRNA ... 40.4 0.49 gb|GARL01030594.1| Spodoptera exigua SEUC25635_TC01 transcrib... 39.7 0.88 gb|GAZS01034153.1| Acanthoscurria geniculata L2169_T1/2_Turan... 38.5 2.1 gb|GAZS01034154.1| Acanthoscurria geniculata L2170_T2/2_Turan... 38.5 2.1 gb|GAYD01030921.1| Blaberus atropos s30958_L_499964_0 transcr... 38.5 2.2 gb|GAZR01021123.1| Stegodyphus mimosarum L19863_T1/1_Velvet_W... 36.2 3.0 gb|GBCX01022664.1| Dastarcus helophoroides Unigene14575 trans... 37.7 3.8 gb|GAYF01148415.1| Nilaparvata lugens C730037 transcribed RNA... 37.7 4.5 gb|GAMK01054259.1| Phaseolus vulgaris Ref_259_comp8866_c0_seq... 37.0 8.5 gb|EZ343106.1| Artemisia annua strain Uganda Contig10322.Uhm ... 35.8 8.9 ALIGNMENTS >gb|GAZV01037943.1| TSA: Apis mellifera comp13466_c0_seq1 transcribed RNA sequence Length=6998 Score = 342 bits (878), Expect = 4e-105, Method: Compositional matrix adjust. Identities = 164/168 (98%), Positives = 167/168 (99%), Gaps = 0/168 (0%) Frame = +2 Query 4 YALYRGGVRVKVVTGRGVDFVRATVSPQQTYGSEVAPTTHISTPLAIEQIPIKGVAEFQI 63 YALYRGGVRVKVVT +GVDFVRATVSPQQTYGS+VAPTTHISTPLAIEQIPIKGVAEFQI Sbjct 6317 YALYRGGVRVKVVTEKGVDFVRATVSPQQTYGSDVAPTTHISTPLAIEQIPIKGVAEFQI 6496 Query 64 PYYAPCLSSSFRANSETFYYSSGRNNLDIATSPPSINRYYAVGAGDDMDFSIFIGTPPCI 123 PYYAPCLSSSFRANSETFYYSSGRNNLDI+TSPPSINRYYAVGAGDDMDFSIFIGTPPCI Sbjct 6497 PYYAPCLSSSFRANSETFYYSSGRNNLDISTSPPSINRYYAVGAGDDMDFSIFIGTPPCI 6676 .......
Upvotes: 1