Reputation: 31
I just installed Biopython and wanted to try out its features and so I started to go through the tutorial.
However, when I reached the chapter about obtaining information from Entrez, I encountered a problem.
The example in the tutorial is simple:
from Bio import Entrez
Entrez.email = "[email protected]"
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
This works fine. But as soon as I want to parse a different database than pubmed I get following error:
Bio.Entrez.Parser.ValidationError: Failed to find tag 'Build' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
Trying the validate=False option also doesn't work, because this raises a Bio.Entrez.Parser.NotXMLError.
Can someone tell me what I did wrong and how can solve this issue?
Upvotes: 3
Views: 1018
Reputation: 1000
In order to get round this problem simply alter your call to Entrez.read()
to include a validate parameter, like so:
record = Entrez.read(handle,validate=False)
The other answer to this question is right. It's a falt in Biopython parser. Hopefully they'll update soon.
Upvotes: 3
Reputation: 2040
THIS IS NOT REALLY A VALID SOLUTION, BUT SHOWS WHAT THE PROBLEM IS. I think it's probably a biopython (Entrez.Parse) bug, so I'll get in contact with them and see what they think.
So a bit of hacking at Biopython shows the problem is because of a 'build' tag name.
If we do this manually, the first few lines of the pubmed
XML request look like this
<eInfoResult>
<DbInfo>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Description>PubMed bibliographic record</Description>
<Count>22224084</Count>
<LastUpdate>2012/10/30 03:30</LastUpdate>
....
But the protein request looks like this;
<eInfoResult>
<DbInfo>
<DbName>protein</DbName>
<MenuName>Protein</MenuName>
<Description>Protein sequence record</Description>
<Build>Build121030-0741m.1</Build> <-------- THIS IS BAD
<Count>59244879</Count>
<LastUpdate>2012/10/30 18:39</LastUpdate>
I had a look at how the Entrez.Parser works, and it basically doesn't recognize the build
tag. Further rooting shows that the tags are defined in DTD files, and einfo DTD file, which on my system is here;
/usr/local/lib/python2.7/dist-packages/Bio/Entrez/DTDs
If we examine the relevant file eInfo_020511.dtd
and add a build tag line (the line below with the arrow wasn't there before);
<!--
This is the Current DTD for Entrez eInfo
$Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $
-->
<!-- ================================================================= -->
<!ELEMENT DbName (#PCDATA)> <!-- \S+ -->
<!ELEMENT Name (#PCDATA)> <!-- .+ -->
<!ELEMENT FullName (#PCDATA)> <!-- .+ -->
<!ELEMENT Description (#PCDATA)> <!-- .+ -->
<!ELEMENT Build (#PCDATA)> <!-- .+ --> <------- I ADDED THIS LINE
<!ELEMENT TermCount (#PCDATA)> <!-- \d+ -->
<!ELEMENT Menu (#PCDATA)> <!-- .+ -->
It now works. The comments on this file suggest it hasn't been updated since 2008 (line below comes form the DTD header).
$Id: eInfo_020511.dtd,v 1.1 2008-05-13 11:17:44 mdehoon Exp $
My guess is that the build tag has been added since then but this file was never updated to reflect that.
Upvotes: 2