Reputation: 413
I have an XML file with this structure:
<doc>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
...
</doc>
I read the file in python through nltk package and parse the tree with ElementTree like this:
from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)
Then I try to print something from 'two' elements like this:
for i, content in enumerate(tree.findall('content')):
for two in content.findall('two'):
if 'keyword' in str(two.text):
print("%s" % (two.text))
And I get the infamous error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)
I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8')
or .decode('UTF-8')
here and there, or also encoding='utf-8'
added in data.find
), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.
How can I avoid the error and print the content I need? Thanks.
Upvotes: 0
Views: 1372
Reputation: 881635
So two.text
should be a Unicode string and you want to print
it -- why not just check
if u'keyword' in two.text:
and then if appropriate
print(two.text)
without the laborious stringification? If your terminal is properly set, it will tell Python which encoding to use to send it bytes properly representing that string for display purposes.
It's usually best to work uniformly in Unicode (that's why str
has become unicode in Python 3:-) and only decode on input, encode on output -- and often the I/O systems will handle the decoding and encoding for you quite transparently.
Depending on your version of Python (which you don't tell us), you may need to do some explicit encoding -- as soon as possible, not late in the day. E.g, if you're stuck with Python 2, and wow
is a Unicode string (depends on your version of nltk
, I think), then
tree = ElementTree().parse(wow.encode('utf8'))
might work better; if wow
is already a utf8-encoded byte string as it comes from nltk, then obviously you won't need to encode it again:-).
To remove such doubts, print(repr(wow[:30]))
or thereabouts will tell you more. And print(sys.version)
will tell you what version of Python so you can in turn tell us, as so few people appear to do even though it's most often absolutely crucial info!-)
Upvotes: 2