Issues reading zip containing UTF-8 xml files

Question

I have zip archives containing many UTF-8 xml files. These files have mostly English tags and text but a few tags contain non-English text. I have no problem with opening the zip file, and parsing the xml files inside of it, but the non-English text looses it's encoding.

When an xml file is extracted and opened in Notepad++ the non-English text looks like:

Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.

When it is extracted and read in Python (on a linux box) the text looks like:

ÐÑÑÑ ÐºÐ°ÑÐ±Ð¾Ð²Ð°Ð½ÑÐ° Ðº Ð´Ð¾Ð»Ð»Ð°ÑÑ Ð½Ðµ Ð¸Ð·Ð¼ÐµÐ½Ð¸Ð»ÑÑ Ð½Ð° Ð£ÐºÑÐ°Ð¸Ð½ÑÐºÐ¾Ð¹ ÐÐµÐ¶Ð±Ð°Ð½ÐºÐ¾Ð²ÑÐºÐ¾Ð¹ ÐÐ°Ð»ÑÑÐ½Ð¾Ð¹ ÐÐ¸ÑÐ¶Ðµ (Ð£ÐÐÐ) - 176.100.

My code looks like:

def parse(self, fp):
    # open/decompress zip file
    with zipfile.ZipFile(fp, 'r') as f:
        # get all files in zip
        comp_files = f.namelist()
        for comp_file in comp_files:
            cfp = f.open(comp_file, 'r')
            # parse xml
            tree = ElementTree.parse(cfp)
            ...parsing...

I have tried decoding/encoding the text from cfp and wrapping it with codecs.EncodedFile() and input encoding of utf_8 and utf_8_sig with no change. What can I do to fix the non-English text?

Mark Tolonen · Accepted Answer

The result you are seeing is UTF-8 incorrectly decoded as latin-1/iso-8859-1:

>>> x=u'Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.'
>>> print x.encode('utf8').decode('latin1')
ÐÑÑÑ ÐºÐ°ÑÐ±Ð¾Ð²Ð°Ð½ÑÐ° Ðº Ð´Ð¾Ð»Ð»Ð°ÑÑ Ð½Ðµ Ð¸Ð·Ð¼ÐµÐ½Ð¸Ð»ÑÑ Ð½Ð° Ð£ÐºÑÐ°Ð¸Ð½ÑÐºÐ¾Ð¹ ÐÐµÐ¶Ð±Ð°Ð½ÐºÐ¾Ð²ÑÐºÐ¾Ð¹ ÐÐ°Ð»ÑÑÐ½Ð¾Ð¹ ÐÐ¸ÑÐ¶Ðµ (Ð£ÐÐÐ) - 176.100.

I saved the following text encoded via Notepad++ as as a single file encoded as UTF-8 without BOM in a zipfile:

Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.

Your code with modifications to make it runable:

from xml.etree import ElementTree
import zipfile

def parse(fp):
    # open/decompress zip file
    with zipfile.ZipFile(fp, 'r') as f:
        # get all files in zip
        comp_files = f.namelist()
        for comp_file in comp_files:
            cfp = f.open(comp_file, 'r')
            # parse xml
            tree = ElementTree.parse(cfp)
            print tree.getroot().text
            print type(tree.getroot().text)

parse(open('file.zip'))

The result:

Курс карбованца к доллару не изменился на Украинской Межбанковской Валютной Бирже (УМВБ) - 176.100.

So it looks to me that it is just being displayed incorrectly on your Linux box, but without an actual sample of the files you are working with, it is difficult to analyze further.

Issues reading zip containing UTF-8 xml files

Answers (1)

Related Questions