Reputation: 6428
I've got an ElementTree
instance I'm trying to output to text using the tostring
method:
tostring(root, encoding='UTF-8')
I get a UnicodeDecodeError
(traceback below) because one of the Element.text
nodes has the u'\u2014'
character. I set the text property as follows:
my_str = u'\u2014'
el.text = my_str.encode('UTF-8')
How can I successfully serialize the tree to text? Am I encoding the nodes incorrectly? Thanks.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "crisis_app/converters/to_xml.py", line 129, in convert
return tostring(root, encoding='UTF-8')
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
write(_escape_cdata(text, encoding))
File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128)
Upvotes: 2
Views: 3718
Reputation: 69042
If you do this:
my_str = u'\u2014'
el.text = my_str.encode('UTF-8')
you're setting the text to an utf-8 encoded version of the unicode character. It's the same as
el.text = '\xe2\x80\x94'
Now you don't have an unicode character anymore, but a series of bytes.
If you then do:
tostring(root, encoding='UTF-8')
You're saying you want the content encoded as utf-8. To to that, internally the string has first to be decoded to unicode using the default encoding (ascii), and then encode as utf-8, which of course fails as the bytes in the string arn't in the ascii range.
ElementTree is perfectly capable of working with unicode, so just give it unicode instead of str:
>>> from xml.etree import ElementTree as et
>>> e = et.Element('test')
>>> e.text = u'\u2014'
>>> s = et.tostring(e)
>>> print s, repr(s)
<test>—</test> '<test>—</test>'
>>> s = et.tostring(e, encoding='utf-8')
>>> print s, repr(s)
<test>—</test> '<test>\xe2\x80\x94</test>'
Upvotes: 2