user1163567
user1163567

Reputation: 21

Python codecs module

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.

import codecs
f = open('C:/temp/list_test.txt', 'r')
    for lines in f:
        line=filter_str(lines.decode("utf-8")

This all works well. I parse the entire file and then want to export 14 different language files. The problem that I can't understand is the following

I use the following code for output:

malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
    for item in lang_dic['English']:
         temp = lang_dic[lang1][item]
         malangout.write(temp + '\n')
    malangout.close() 

Example:

The string is stored as is:

u'Dziennik zak\u201a\xf3ce\u0192'

I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.

Upvotes: 2

Views: 7682

Answers (1)

unutbu
unutbu

Reputation: 880707

The string is stored as is:

u'Dziennik zak\u201a\xf3ce\u0192'

Well, that's a problem since

In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ

in contrast to

In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń

So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain

In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'

or

In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'

?


PS. You may want to change

temp + '\n'

to

temp + u'\n'

to make it clear you are adding two unicode together to form a unicode. The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

Upvotes: 1

Related Questions