Reputation: 2958
I am trying to parse Medline xml documents using iterparse in the xml.etree.ElementTree module. All is working well except that some of the text includes non-ascii characters. I do not see a way of handling unicode using findtext. Any suggestions?
Upvotes: 2
Views: 1295
Reputation: 2958
This was a very useful post in addition to the answer above.
Reading utf-8 characters from a gzip file in python
Upvotes: 0
Reputation: 52738
Have you tried opening the file with utf8 encoding flah:
fd = open('some.xml', mode='r', encoding='utf-8')
xml.etree.ElementTree.iterparse(fd)
Or use decode:
fd = open('some.xml', mode='r')
sio = StringIO(fd.read().decode("utf-8"))
xml.etree.ElementTree.iterparse(sio)
Upvotes: 2