Reputation: 1443
Assume I have a document which uses Unicode in tag names, as for example <año>2012</año>
.
When I use etree from lxml to parse such a document, I have no problems, the tree is correctly built. But when (for debugging purposes) I try to print some elements, I get an exception about a failed attempt to encode as ASCII some unicode char.
Is not a problem of terminal configuration or bad encoding of the file, since I can print without problem the name of the node (.tag
), which contains the same unicode char. Apparently the problem is caused by the "stringification" of the Element object, which assumes that the tag names are aways plain ascii.
The following code shows the problem (and also shows that it is not a file/terminal/encoding problem).
# coding: utf-8
from lxml import etree
doc = """<?xml version="1.0" encoding="utf-8"?>
<año>2012</año>
"""
x = etree.fromstring(doc) # No problem
print x.tag # No problem
print x # Exception
Running the above script in a terminal with a properly defined LC_CTYPE, produces the following output:
año
Traceback (most recent call last):
File "procesar.py", line 8, in <module>
print x
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 10: ordinal not in range(128)
Note how print x.tag
outputs correctly año
. Shouldn't print x
produce something like <Element año at b7d26eb4>
?
Is this a known problem? Any ideas about workarounds?
Upvotes: 2
Views: 1414
Reputation: 12946
You have to transform unicode strings into byte strings before output
Try:
print unicode(x).encode('utf8')
quoting the unicode function:
For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.
Upvotes: 4