Eduard Florinescu
Eduard Florinescu

Reputation: 17521

Python: Which XML parser supports DTD !ENTITY definitions?

I have the below XML file, currently I am using minidom and I get for the example the documentElement's tagName as being xyz:widget that tells me that it ignores the !ENTITY definitions and thus the!DOCTYPE reference.

Which XML parser supports Document Type Definitions so that !ENTITY definitions and !DOCTYPE reference will no be ignored:

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE widget [
<!ENTITY widgets-ns "http://www.w3.org/ns/widgets">
<!ENTITY pass "pass&amp;.html">
]>
<xyz:widget xmlns:xyz="&widgets-ns;">
  <xyz:content src="&pass;"/>
  <xyz:name>bv</xyz:name>
</xyz:widget>

So that for the above example, you can get using python the XML equivalent:

<widget xmlns="http://www.w3.org/ns/widgets">
  <content src="pass&amp;.html"/>
  <name>bv</name>
</widget>

or to get a DOM that has as a documentElement as widget and its childNodes as content and name, widget attribute as xmlns with value http://www.w3.org/ns/widgets, etc

I probably may not used the correct terminology, but I hope I made myself clear with the help of the above examples.

Upvotes: 3

Views: 912

Answers (1)

Fred Foo
Fred Foo

Reputation: 363607

LXML handles this just fine:

>>> from lxml import etree
>>> s = """<?xml version="1.0" standalone="yes" ?>
... <!DOCTYPE widget [
... <!ENTITY widgets-ns "http://www.w3.org/ns/widgets">
... <!ENTITY pass "pass&amp;.html">
... ]>
... <xyz:widget xmlns:xyz="&widgets-ns;">
...   <xyz:content src="&pass;"/>
...   <xyz:name>bv</xyz:name>
... </xyz:widget>
... """
>>> etree.fromstring(s)
<Element {http://www.w3.org/ns/widgets}widget at 7f4de2cc58e8>
>>> etree.fromstring(s).xpath("//xyz:content/@src",
...                           namespaces={"xyz": "http://www.w3.org/ns/widgets"})
['pass&.html']

Upvotes: 6

Related Questions