Reputation: 125
I have zip archives containing many UTF-8 xml files. These files have mostly English tags and text but a few tags contain non-English text. I have no problem with opening the zip file, and parsing the xml files inside of it, but the non-English text looses it's encoding.
When an xml file is extracted and opened in Notepad++ the non-English text looks like:
Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.
When it is extracted and read in Python (on a linux box) the text looks like:
ÐÑÑÑ ÐºÐ°ÑбованÑа к доллаÑÑ Ð½Ðµ изменилÑÑ Ð½Ð° УкÑаинÑкой ÐежбанковÑкой ÐалÑÑной ÐиÑже (УÐÐÐ) - 176.100.
My code looks like:
def parse(self, fp):
# open/decompress zip file
with zipfile.ZipFile(fp, 'r') as f:
# get all files in zip
comp_files = f.namelist()
for comp_file in comp_files:
cfp = f.open(comp_file, 'r')
# parse xml
tree = ElementTree.parse(cfp)
...parsing...
I have tried decoding/encoding the text from cfp and wrapping it with codecs.EncodedFile() and input encoding of utf_8 and utf_8_sig with no change. What can I do to fix the non-English text?
Upvotes: 1
Views: 4002
Reputation: 177901
The result you are seeing is UTF-8 incorrectly decoded as latin-1/iso-8859-1:
>>> x=u'Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.'
>>> print x.encode('utf8').decode('latin1')
ÐÑÑÑ ÐºÐ°ÑбованÑа к доллаÑÑ Ð½Ðµ изменилÑÑ Ð½Ð° УкÑаинÑкой ÐежбанковÑкой ÐалÑÑной ÐиÑже (УÐÐÐ) - 176.100.
I saved the following text encoded via Notepad++ as as a single file encoded as UTF-8 without BOM in a zipfile:
<text>Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.</text>
Your code with modifications to make it runable:
from xml.etree import ElementTree
import zipfile
def parse(fp):
# open/decompress zip file
with zipfile.ZipFile(fp, 'r') as f:
# get all files in zip
comp_files = f.namelist()
for comp_file in comp_files:
cfp = f.open(comp_file, 'r')
# parse xml
tree = ElementTree.parse(cfp)
print tree.getroot().text
print type(tree.getroot().text)
parse(open('file.zip'))
The result:
Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.
<type 'unicode'>
So it looks to me that it is just being displayed incorrectly on your Linux box, but without an actual sample of the files you are working with, it is difficult to analyze further.
Upvotes: 5