tags with unicode in names, and lxml

Question

Assume I have a document which uses Unicode in tag names, as for example 2012.

When I use etree from lxml to parse such a document, I have no problems, the tree is correctly built. But when (for debugging purposes) I try to print some elements, I get an exception about a failed attempt to encode as ASCII some unicode char.

Is not a problem of terminal configuration or bad encoding of the file, since I can print without problem the name of the node (.tag), which contains the same unicode char. Apparently the problem is caused by the "stringification" of the Element object, which assumes that the tag names are aways plain ascii.

The following code shows the problem (and also shows that it is not a file/terminal/encoding problem).

# coding: utf-8
from lxml import etree
doc = """
2012
"""
x = etree.fromstring(doc)   # No problem
print x.tag                 # No problem
print x                     # Exception

Running the above script in a terminal with a properly defined LC_CTYPE, produces the following output:

año
Traceback (most recent call last):
  File "procesar.py", line 8, in 
    print x
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 10: ordinal not in range(128)

Note how print x.tag outputs correctly año. Shouldn't print x produce something like ?

Is this a known problem? Any ideas about workarounds?

KurzedMetal · Accepted Answer

You have to transform unicode strings into byte strings before output

Try:

print unicode(x).encode('utf8')

quoting the unicode function:

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

tags with unicode in names, and lxml

Answers (1)

Related Questions