seandavi
seandavi

Reputation: 2958

xml.etree.ElementTree and unicode findtext

I am trying to parse Medline xml documents using iterparse in the xml.etree.ElementTree module. All is working well except that some of the text includes non-ascii characters. I do not see a way of handling unicode using findtext. Any suggestions?

Upvotes: 2

Views: 1295

Answers (2)

seandavi
seandavi

Reputation: 2958

This was a very useful post in addition to the answer above.

Reading utf-8 characters from a gzip file in python

Upvotes: 0

chown
chown

Reputation: 52738

Have you tried opening the file with utf8 encoding flah:

fd = open('some.xml', mode='r', encoding='utf-8')
xml.etree.ElementTree.iterparse(fd)

Or use decode:

fd = open('some.xml', mode='r')
sio = StringIO(fd.read().decode("utf-8"))
xml.etree.ElementTree.iterparse(sio)

Upvotes: 2

Related Questions