Reputation: 115
I have text "confrères" in a text file with encoded format "ISO-8859-2". I want to encode this value in "UTF-8" in python.
I used following code in python(2.7) to convert it but the converted value ["confrčres"] is different from original value ["confrères"].
# -*- coding: utf-8 -*-
import chardet
import codecs
a1=codecs.open('.../test.txt', 'r')
a=a1.read()
b = a.decode(chardet.detect(a)['encoding']).encode('utf8')
a1=codecs.open('.../test_out.txt', 'w').write(b)
Any idea how to get actual value but in UTF8 encoded format in the output file.
Thanks
Upvotes: 1
Views: 887
Reputation: 1125368
If you know the codec used, don't use chardet
. Character detection is never foolproof, the library guessed wrong for your file.
Note that ISO-8859-2 is the wrong codec, as that codec cannot even encode the letter è
. You have ISO-8859-1 (Latin-1) or Windows codepage 1252 data instead; è
in 8859-1 and cp1252 is encoded to 0xE8, and 0xE8 in 8859-2 is č
:
>>> print u'confrčres'.encode('iso-8859-2').decode('iso-8859-1')
confrères
Was 8859-2 perhaps the guess chardet
made?
You can use the io
library to handle decoding and encoding on the fly; it is the same codebase that handles all I/O in Python 3 and has fewer issues than codecs
:
from shutil import copyfileobj
with open('test.txt', 'r', encoding='iso-8859-1') as inf:
with open('test_out.txt', 'w', encoding='utf8') as outf:
copyfileobj(inf, outf)
I used shutil.copyfileobj()
to handle the copying across of data.
Upvotes: 5