Reputation: 7793
I noticed the xml entities " will automatically force to convert to their real original characters:
>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>"hello world"</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
<elem>"hello world"</elem>
</root>
>>>
I found one related old(2009-02-07) thread:
s = cStringIO.StringIO(""""She's the MAN!"""") e = etree.parse(s,etree.XMLParser(resolve_entities=False))
Note that there's also etree.fromstring().
etree.tostring(e) '"She\'s the MAN!"'
I would have expected resolve_entities=False to have prevented the translation of, eg, " to ".
The "resolve_entities" option is meant for entities defined in a DTD of which you want to keep the reference instead of the resolved value. The entities you mention are part of the XML spec, not of a DTD.
is there another way to prevent this behavior (or, if nothing else, reverse it after the fact)?
Well, what you get is well-formed XML. May I ask why you need the entity references in the output?
Still, the response is why you want to do that, there's no direct answer to this problem. I'm quite surprised because the etree parser force the conversion without giving an option to disable it.
The following example shown why i need this solution, this xml is for xbmc skinning parser:
>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>Close</onfocus>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
... for cc in c:
... if cc.attrib.get('id') == "103":
... cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
...
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>>
As you can see of the onfocus element under id "103" at the end, the " are no longer in their original form, and it lead to bug if the "$INFO[VideoPlayer.Album]" variable contains nested quotes and become ""test"" which was invalid and error.
So is it any hacky way i can keep " in their original form ?
[UPDATE]: For someone who interest, the other 3 predefined xml entities, i.e. gt, lt and amp will only get converted by using method="html" and script tag. Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confuse:
>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>"'&><</script><p>"'&><</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
<script>"'&><</script>
<p>"'&><</p>
</root>
>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&><</p></root>
>>>
[UPDATE2]: The following is the list of all possible html tags:
#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
r = et.fromstring(e.join(["<", ">hello&world</", ">"]))
s = et.tostring(r, pretty_print=1, method="html")
closed_tag = "</" + e + ">"
if closed_tag not in s:
print s
Run this code and you will see output as following:
<area>
<br>
<col>
<hr>
<img>
<input>
As you can see, only opening tag printed and the rest was just go into black hole. I tested all 5 xml entities and all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so i guess there's buggy between fromstring(method should be default to xml) and tostring(method="html") steps. And i found it has nothing to do with entities because "< img >hello< /img >"(without entities) is truncate into < img > too(and hello just gone to nowhere, it can appear at anytime if use method="xml" to print out).
Upvotes: 4
Views: 3376
Reputation: 2135
from xml.sax.saxutils import escape
from lxml import etree
def to_string(xdoc):
r = ""
for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
if action == 'start':
text = escape(elem.text, {"'": "'", "\"": """}) if elem.text is not None else ""
attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
r += "<%s%s>%s" % (elem.tag, attrs, text)
elif action == 'end':
r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)
Upvotes: 3