Reputation: 21
I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs
module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export 14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.
Upvotes: 2
Views: 7682
Reputation: 880707
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt
? That is, does list_test.txt
contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode
together to form a unicode
.
The two lines above have the same result in Python2, but in Python3 adding a unicode
and str
together would raise a TypeError
. Even though in Python3, '\n'
is unicode
, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode
and str
. In Python2 it is silently attempted for you, in Python3 it is disallowed.
Upvotes: 1