user1752645
user1752645

Reputation: 31

Bio.Entrez.Parser.ValidationError: Failed to find tag 'Build' in the DTD

I just installed Biopython and wanted to try out its features and so I started to go through the tutorial.

However, when I reached the chapter about obtaining information from Entrez, I encountered a problem.

The example in the tutorial is simple:

from Bio import Entrez
Entrez.email = "[email protected]"
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)

This works fine. But as soon as I want to parse a different database than pubmed I get following error:

Bio.Entrez.Parser.ValidationError: Failed to find tag 'Build' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.

Trying the validate=False option also doesn't work, because this raises a Bio.Entrez.Parser.NotXMLError.

Can someone tell me what I did wrong and how can solve this issue?

Upvotes: 3

Views: 1018

Answers (2)

jhrf
jhrf

Reputation: 1000

In order to get round this problem simply alter your call to Entrez.read() to include a validate parameter, like so:

record = Entrez.read(handle,validate=False)

The other answer to this question is right. It's a falt in Biopython parser. Hopefully they'll update soon.

Upvotes: 3

Alex
Alex

Reputation: 2040

THIS IS NOT REALLY A VALID SOLUTION, BUT SHOWS WHAT THE PROBLEM IS. I think it's probably a biopython (Entrez.Parse) bug, so I'll get in contact with them and see what they think.

So a bit of hacking at Biopython shows the problem is because of a 'build' tag name.

If we do this manually, the first few lines of the pubmed XML request look like this

<eInfoResult>
  <DbInfo>
    <DbName>pubmed</DbName>
    <MenuName>PubMed</MenuName>
    <Description>PubMed bibliographic record</Description>
    <Count>22224084</Count>
    <LastUpdate>2012/10/30 03:30</LastUpdate>
    ....

But the protein request looks like this;

<eInfoResult>
  <DbInfo>
    <DbName>protein</DbName>
    <MenuName>Protein</MenuName>
    <Description>Protein sequence record</Description>
    <Build>Build121030-0741m.1</Build>                   <-------- THIS IS BAD
    <Count>59244879</Count>
    <LastUpdate>2012/10/30 18:39</LastUpdate>

I had a look at how the Entrez.Parser works, and it basically doesn't recognize the build tag. Further rooting shows that the tags are defined in DTD files, and einfo DTD file, which on my system is here;

/usr/local/lib/python2.7/dist-packages/Bio/Entrez/DTDs

If we examine the relevant file eInfo_020511.dtd and add a build tag line (the line below with the arrow wasn't there before);

<!--    
                This is the Current DTD for Entrez eInfo
$Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $
-->
<!-- ================================================================= -->

<!ELEMENT   DbName      (#PCDATA)>  <!-- \S+ -->
<!ELEMENT   Name        (#PCDATA)>  <!-- .+ -->
<!ELEMENT   FullName    (#PCDATA)>  <!-- .+ -->
<!ELEMENT   Description (#PCDATA)>  <!-- .+ -->
<!ELEMENT   Build       (#PCDATA)>  <!-- .+ -->     <------- I ADDED THIS LINE
<!ELEMENT   TermCount   (#PCDATA)>  <!-- \d+ -->
<!ELEMENT   Menu        (#PCDATA)>  <!-- .+ -->

It now works. The comments on this file suggest it hasn't been updated since 2008 (line below comes form the DTD header).

 $Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $

My guess is that the build tag has been added since then but this file was never updated to reflect that.

Upvotes: 2

Related Questions