Reputation: 67
I have a string which is automatically converted to byte code by my IDE (very old Boa Constructor). Now I want to convert it to unicode in order to print it with the encoding on the specific machine (cp1252 on windows or utf-8 on Linux).
I use two different ways. One of them is working the other one is not working. But why?
Here the working version:
#!/usr/bin/python
# vim: set fileencoding=cp1252 :
str = '\x80'
str = str.decode('cp1252') # to unicode
str = str.encode('cp1252') # to str
print str
Here the not working version:
#!/usr/bin/python
# vim: set fileencoding=cp1252 :
str = u'\x80'
#str = str.decode('cp1252') # to unicode
str = str.encode('cp1252') # to str
print str
In version 1 I convert the str to unicode via the decode function. In version 2 I convert the str to unicode via the u in front of the string. But I thought, the two versions would do exactly the same?
Upvotes: 0
Views: 9289
Reputation: 536399
'\x80'.decode('cp1252')
doesn't give u'\u0080'
(which is the same thing as u'\x80'
).
Byte 0x80 in Windows code page 1252 decodes to Unicode character €
U+20AC Euro sign.
There is an encoding where all of the bytes 0x00 to 0xFF decode to the Unicode characters with the same numbers U+0000 to U+00FF: it is iso-8859-1
. With that encoding, your example works.
Windows cp1252
is similar to that encoding but not the same: whilst 0xA0 to 0xFF are the same as in iso-8859-1
and so you get the direct mapping behaviour for those characters, bytes 0x80 to 0x9F are an assortment of extra symbols from other Unicode blocks, instead of the invisible (and largely useless) control codes U+0080 to U+009F.
Upvotes: 1
Reputation: 29727
str.decode
is not just prepending u
to the string literal. It translates bytes of input string to meaningful characters (i.e. Unicode).
Then you are calling encode
to convert this characters to bytes, since you need to "print", output them to the terminal or any other OS entity (like GUI window).
So, about your specific task, I believe you want something like:
s = '\x80'
print s.decode('cp1251').encode(platform_encoding)
where 'cp1251'
is encoding of your IDE, and platform_encoding
is a variable with encoding of current system.
In the reply to your comment:
But the str.decode should have used the source code encoding (from line 2 in the file) to decode. So there should not be a difference to the u
This is incorrect assumption. From Defining Python Source Code Encodings
The encoding information is then used by the Python parser to interpret the file using the given encoding.
So set fileencoding=cp1252
just tells the interpreter how to convert characters [you entered via editor] to bytes when it parses line str = '\x80'
. This information is not used during str.decode
calls.
Also you are asking, what u'\x80' is? \x80
is simply interpretered as \u0080
, and this is obviously not what you want. Take a look on this question - Bytes in a unicode Python string.
Upvotes: 1