Jonathan
Jonathan

Reputation: 11

Output from BlastP using NCBIWWW is not what I expected

I am trying to get blastP results for a specific protein using NCBIWWW. The problem is that what gets sent back is not what I recognize as alignment data, what I do get is this (This is the contents of 'Blast_record' in the source code); I am using code obtained from the 'BioPython tutorial and cookbook', I have scoured it and the internet for the source of error but I just can't find it. My source code is this;

# biopython
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

# first get the sequence we want to parse from a FASTA file
# f_record = next(SeqIO.parse('m_cold.fasta', 'fasta'))

print('Doing the BLAST and retrieving the results...')
result_handle = NCBIWWW.qblast('blastp', 'tsa', '365176198')

# save the results for later, in case we want to look at it
save_file = open('m_cold_blast.out', 'w')
blast_results = result_handle.read()
save_file.write(blast_results)
save_file.close()

The resulting file is:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastp</BlastOutput_program>
<BlastOutput_version>BLASTP 2.2.31+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>tsa</BlastOutput_db>
<BlastOutput_query-ID>gi|365176198|gb|AEW67975.1|</BlastOutput_query-ID>
<BlastOutput_query-def>polyprotein [Black queen cell virus]    </BlastOutput_query-def>
<BlastOutput_query-len>171</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
  <Parameters_matrix>BLOSUM62</Parameters_matrix>
  <Parameters_expect>10</Parameters_expect>
  <Parameters_gap-open>11</Parameters_gap-open>
  <Parameters_gap-extend>1</Parameters_gap-extend>
  <Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|365176198|gb|AEW67975.1|</Iteration_query-ID>
<Iteration_query-def>polyprotein [Black queen cell virus]</Iteration_query-def>
<Iteration_query-len>171</Iteration_query-len>
<Iteration_hits>
</Iteration_hits>
<Iteration_stat>
<Statistics>
  <Statistics_db-num>0</Statistics_db-num>
  <Statistics_db-len>0</Statistics_db-len>
  <Statistics_hsp-len>0</Statistics_hsp-len>
  <Statistics_eff-space>0</Statistics_eff-space>
  <Statistics_kappa>-1</Statistics_kappa>
  <Statistics_lambda>-1</Statistics_lambda>
  <Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

Now, if I perform a search using BlastN and the nucleotide sequence of the protein used above I get all the matched sequences, their E values and Scores etc. So why is this not the case when using BlastP?

I am very new to both Python and Biopython and for the life of me I cannot figure out what I am doing wrong.

Upvotes: 1

Views: 1308

Answers (2)

Sweasonable Doubt
Sweasonable Doubt

Reputation: 98

I noticed your output file is ends in '.out'. Try saving it to an XML file, and it will map them to nice columns. In the first line of your output file, you'll see '?xml. The qBLAST function is uses XML by default, and has an optional argument for "Text" as well, though the formatting is a nightmare.

It is also true that blastp and tsa are different databases. The qblast module has some built in help that might help with different arguments, which can be accessed with this.

>>> from Bio.Blast import NCBIWWW
>>> help(NCBIWWW.qblast)

(I would run your code but qBLAST is having issues right at the time of this writing)

Upvotes: 0

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8164

QUERY 365176198 is a protein

DATABASE are nucleotics

What is the Transcriptome Shotgun Assembly (TSA) Database?

TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs.

Are TSA sequences available by a BLAST search?

A Transcriptome Shotgun Assembly (TSA) BLAST database is now available. The sequences were initially included in nt but now have been segregated into a separate database. The TSA database is available from the BLAST home page under Basic BLAST at the nucleotide, tblastn, and tblastx links. These sequences are not available in nt.

BLAST FLAVORS

blastp: 
     compares an amino acid query sequence against a protein sequence
     database

blastn: compares a nucleotide query sequence against a nucleotide 
     sequence database

blastx: compares a nucleotide query sequence translated in all 
     reading frames against a protein sequence database

tblastn: compares a protein query sequence against a nucleotide 
     sequence database dynamically translated in all reading frames

tblastx: compares the six-frame translations of a nucleotide query
     sequence against the six-frame translations of a nucleotide 
     sequence database. Please note that tblastx program cannot be 
     used with the nr database on the BLAST Web page.

MANUALLY ANSWER

input blast manually

output blast manually

BIOPYTHON ANSWER

you must use "tsa_nt" instead of "tsa", and "tblastn" instead of "blastp"

query = '365176198'
#note: this may take several minutes
result_handle = NCBIWWW.qblast('tblastn', 'tsa_nt', query, format_type="Text")

you get:

.......

                                                                   Score     E
Sequences producing significant alignments:                       (Bits)  Value

gb|GAZV01037943.1|  Apis mellifera comp13466_c0_seq1 transcrib...  342     4e-105
gb|GAZF01116856.1|  Essigella californica C629542 transcribed ...  179     1e-54 
gb|GBYB01008381.1|  Fopius arisanus c20283_g1_i1 transcribed R...  149     2e-37 
gb|GAUO01000423.1|  Velia caprai s423_L_1942_0 transcribed RNA...  58.9    3e-07 
gb|GAXG01028220.1|  Gynaikothrips ficorum s28263_L_292921_0 tr...  57.0    9e-07 
gb|GAWP01023404.1|  Grylloblatta bifratrilecta s23438_L_295244...  52.8    4e-05 
gb|GAWZ01143177.1|  Gryllotalpa sp. AD-2013 C589197 transcribe...  45.4    0.002 
gb|GAXW01013938.1|  Euroleon nostras s13984_L_116369_0 transcr...  45.8    0.006 
gb|GAXC01050700.1|  Thrips palmi C235436 transcribed RNA sequence  42.0    0.017 
gb|GAXH01037906.1|  Parides eurimedes C235744 transcribed RNA ...  40.4    0.069 
gb|GBES01007135.1|  Dichelops melacanthus Locus_17334_Transcri...  39.7    0.18  
gb|GBXI01014067.1|  Bactrocera cucurbitae c16593_g1_i1 transcr...  40.4    0.49  
gb|GAMC01001920.1|  Ceratitis capitata comp55379_c0_seq1 mRNA ...  40.4    0.49  
gb|GARL01030594.1|  Spodoptera exigua SEUC25635_TC01 transcrib...  39.7    0.88  
gb|GAZS01034153.1|  Acanthoscurria geniculata L2169_T1/2_Turan...  38.5    2.1   
gb|GAZS01034154.1|  Acanthoscurria geniculata L2170_T2/2_Turan...  38.5    2.1   
gb|GAYD01030921.1|  Blaberus atropos s30958_L_499964_0 transcr...  38.5    2.2   
gb|GAZR01021123.1|  Stegodyphus mimosarum L19863_T1/1_Velvet_W...  36.2    3.0   
gb|GBCX01022664.1|  Dastarcus helophoroides Unigene14575 trans...  37.7    3.8   
gb|GAYF01148415.1|  Nilaparvata lugens C730037 transcribed RNA...  37.7    4.5   
gb|GAMK01054259.1|  Phaseolus vulgaris Ref_259_comp8866_c0_seq...  37.0    8.5   
gb|EZ343106.1|  Artemisia annua strain Uganda Contig10322.Uhm ...  35.8    8.9   

ALIGNMENTS
>gb|GAZV01037943.1| TSA: Apis mellifera comp13466_c0_seq1 transcribed RNA sequence
Length=6998

 Score = 342 bits (878),  Expect = 4e-105, Method: Compositional matrix adjust.
 Identities = 164/168 (98%), Positives = 167/168 (99%), Gaps = 0/168 (0%)
 Frame = +2

Query  4     YALYRGGVRVKVVTGRGVDFVRATVSPQQTYGSEVAPTTHISTPLAIEQIPIKGVAEFQI  63
             YALYRGGVRVKVVT +GVDFVRATVSPQQTYGS+VAPTTHISTPLAIEQIPIKGVAEFQI
Sbjct  6317  YALYRGGVRVKVVTEKGVDFVRATVSPQQTYGSDVAPTTHISTPLAIEQIPIKGVAEFQI  6496

Query  64    PYYAPCLSSSFRANSETFYYSSGRNNLDIATSPPSINRYYAVGAGDDMDFSIFIGTPPCI  123
             PYYAPCLSSSFRANSETFYYSSGRNNLDI+TSPPSINRYYAVGAGDDMDFSIFIGTPPCI
Sbjct  6497  PYYAPCLSSSFRANSETFYYSSGRNNLDISTSPPSINRYYAVGAGDDMDFSIFIGTPPCI  6676

.......

Upvotes: 1

Related Questions