UnicodeEncodeError: how to encode xml tree parsed with ElementTree

Question

I have an XML file with this structure:


 
  Title
  bla bla bla bla
 
 
  Title
  bla bla bla bla
 
 ...

I read the file in python through nltk package and parse the tree with ElementTree like this:

from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)

Then I try to print something from 'two' elements like this:

for i, content in enumerate(tree.findall('content')):
    for two in content.findall('two'):
        if 'keyword' in str(two.text):
            print("%s" % (two.text))

And I get the infamous error:

Traceback (most recent call last):
   File "", line 3, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)

I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8') or .decode('UTF-8') here and there, or also encoding='utf-8' added in data.find), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.

How can I avoid the error and print the content I need? Thanks.

Alex Martelli · Accepted Answer

So two.text should be a Unicode string and you want to print it -- why not just check

if u'keyword' in two.text:

and then if appropriate

print(two.text)

without the laborious stringification? If your terminal is properly set, it will tell Python which encoding to use to send it bytes properly representing that string for display purposes.

It's usually best to work uniformly in Unicode (that's why str has become unicode in Python 3:-) and only decode on input, encode on output -- and often the I/O systems will handle the decoding and encoding for you quite transparently.

Depending on your version of Python (which you don't tell us), you may need to do some explicit encoding -- as soon as possible, not late in the day. E.g, if you're stuck with Python 2, and wow is a Unicode string (depends on your version of nltk, I think), then

tree = ElementTree().parse(wow.encode('utf8'))

might work better; if wow is already a utf8-encoded byte string as it comes from nltk, then obviously you won't need to encode it again:-).

To remove such doubts, print(repr(wow[:30])) or thereabouts will tell you more. And print(sys.version) will tell you what version of Python so you can in turn tell us, as so few people appear to do even though it's most often absolutely crucial info!-)

UnicodeEncodeError: how to encode xml tree parsed with ElementTree

Answers (1)

Related Questions