Reputation: 17521
I have the below XML file, currently I am using minidom and I get for the example the documentElement
's tagName
as being xyz:widget
that tells me that it ignores the !ENTITY
definitions and thus the!DOCTYPE
reference.
Which XML parser supports Document Type Definitions so that !ENTITY definitions and !DOCTYPE reference will no be ignored:
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE widget [
<!ENTITY widgets-ns "http://www.w3.org/ns/widgets">
<!ENTITY pass "pass&.html">
]>
<xyz:widget xmlns:xyz="&widgets-ns;">
<xyz:content src="&pass;"/>
<xyz:name>bv</xyz:name>
</xyz:widget>
So that for the above example, you can get using python the XML
equivalent:
<widget xmlns="http://www.w3.org/ns/widgets">
<content src="pass&.html"/>
<name>bv</name>
</widget>
or to get a DOM
that has as a documentElement
as widget
and its childNodes
as content
and name
, widget
attribute as xmlns
with value http://www.w3.org/ns/widgets
, etc
I probably may not used the correct terminology, but I hope I made myself clear with the help of the above examples.
Upvotes: 3
Views: 912
Reputation: 363607
LXML handles this just fine:
>>> from lxml import etree
>>> s = """<?xml version="1.0" standalone="yes" ?>
... <!DOCTYPE widget [
... <!ENTITY widgets-ns "http://www.w3.org/ns/widgets">
... <!ENTITY pass "pass&.html">
... ]>
... <xyz:widget xmlns:xyz="&widgets-ns;">
... <xyz:content src="&pass;"/>
... <xyz:name>bv</xyz:name>
... </xyz:widget>
... """
>>> etree.fromstring(s)
<Element {http://www.w3.org/ns/widgets}widget at 7f4de2cc58e8>
>>> etree.fromstring(s).xpath("//xyz:content/@src",
... namespaces={"xyz": "http://www.w3.org/ns/widgets"})
['pass&.html']
Upvotes: 6