annamalai muthuraman
annamalai muthuraman

Reputation: 115

Encoding Decoding Python

I have text "confrères" in a text file with encoded format "ISO-8859-2". I want to encode this value in "UTF-8" in python.

I used following code in python(2.7) to convert it but the converted value ["confrčres"] is different from original value ["confrères"].

# -*- coding: utf-8 -*-

import chardet
import codecs

a1=codecs.open('.../test.txt', 'r')

a=a1.read()

b = a.decode(chardet.detect(a)['encoding']).encode('utf8')

a1=codecs.open('.../test_out.txt', 'w').write(b)

Any idea how to get actual value but in UTF8 encoded format in the output file.

Thanks

Upvotes: 1

Views: 887

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1125368

If you know the codec used, don't use chardet. Character detection is never foolproof, the library guessed wrong for your file.

Note that ISO-8859-2 is the wrong codec, as that codec cannot even encode the letter è. You have ISO-8859-1 (Latin-1) or Windows codepage 1252 data instead; è in 8859-1 and cp1252 is encoded to 0xE8, and 0xE8 in 8859-2 is č:

>>> print u'confrčres'.encode('iso-8859-2').decode('iso-8859-1')
confrères

Was 8859-2 perhaps the guess chardet made?

You can use the io library to handle decoding and encoding on the fly; it is the same codebase that handles all I/O in Python 3 and has fewer issues than codecs:

from shutil import copyfileobj

with open('test.txt', 'r', encoding='iso-8859-1') as inf:
    with open('test_out.txt', 'w', encoding='utf8') as outf:
        copyfileobj(inf, outf)

I used shutil.copyfileobj() to handle the copying across of data.

Upvotes: 5

Related Questions