aaronstacy
aaronstacy

Reputation: 6428

python ElementTree decoding error

I've got an ElementTree instance I'm trying to output to text using the tostring method:

tostring(root, encoding='UTF-8')

I get a UnicodeDecodeError (traceback below) because one of the Element.text nodes has the u'\u2014' character. I set the text property as follows:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

How can I successfully serialize the tree to text? Am I encoding the nodes incorrectly? Thanks.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "crisis_app/converters/to_xml.py", line 129, in convert
    return tostring(root, encoding='UTF-8')
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128)

Upvotes: 2

Views: 3718

Answers (1)

mata
mata

Reputation: 69042

If you do this:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

you're setting the text to an utf-8 encoded version of the unicode character. It's the same as

el.text = '\xe2\x80\x94'

Now you don't have an unicode character anymore, but a series of bytes.

If you then do:

tostring(root, encoding='UTF-8')

You're saying you want the content encoded as utf-8. To to that, internally the string has first to be decoded to unicode using the default encoding (ascii), and then encode as utf-8, which of course fails as the bytes in the string arn't in the ascii range.

ElementTree is perfectly capable of working with unicode, so just give it unicode instead of str:

>>> from xml.etree import ElementTree as et
>>> e = et.Element('test')
>>> e.text = u'\u2014'

>>> s = et.tostring(e)
>>> print s, repr(s)
<test>&#8212;</test> '<test>&#8212;</test>'

>>> s = et.tostring(e, encoding='utf-8')
>>> print s, repr(s)
<test>—</test> '<test>\xe2\x80\x94</test>'

Upvotes: 2

Related Questions