xiº
xiº

Reputation: 4687

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.

So there is a simple script:

# -*- coding: cp1251 -*-

print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')

In cmd I've switched encoding to Active code page: 1251.

And there is the output:

СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод

I am a little bit confused.

Since I've specified encoding to cp1251 I expect that it would be decoded correctly.

But as result there is some trash code points were interpreted. I am understand that 'юникод' is just a bytes like: '\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.

But there is a way to get correct output in terminal with cp1251? Should I build byte string manually?

Seems like I misunderstood something.

Upvotes: 1

Views: 8236

Answers (3)

jfs
jfs

Reputation: 414905

Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.

>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод

Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead. If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console

Here's complete code, for reference:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')

Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.

If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 149195

I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.

The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.

If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'

If you write this source:

txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')

and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.

Alternatively, if you find comfortable with you current editor, you could write:

# -*- coding: utf8 -*-

txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')

which should give same results - but now the declared source encoding should be the actual encoding of your python source.


BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:

>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'

Upvotes: 6

Mark Tolonen
Mark Tolonen

Reputation: 178409

Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:

#coding:utf8
print u'юникод'

The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.

1Unless your terminal is misconfigured.

Upvotes: 0

Related Questions