UnicodeEncodeError : 'charmap' codec can't encode character '\x80' in position 0 : character maps to

Question

I have a string which is automatically converted to byte code by my IDE (very old Boa Constructor). Now I want to convert it to unicode in order to print it with the encoding on the specific machine (cp1252 on windows or utf-8 on Linux).

I use two different ways. One of them is working the other one is not working. But why?

Here the working version:

#!/usr/bin/python
# vim: set fileencoding=cp1252 :

str = '\x80'
str = str.decode('cp1252') # to unicode
str = str.encode('cp1252') # to str
print str

Here the not working version:

#!/usr/bin/python
# vim: set fileencoding=cp1252 :

str = u'\x80'
#str = str.decode('cp1252') # to unicode
str = str.encode('cp1252') # to str
print str

In version 1 I convert the str to unicode via the decode function. In version 2 I convert the str to unicode via the u in front of the string. But I thought, the two versions would do exactly the same?

Roman Bodnarchuk · Accepted Answer

str.decode is not just prepending u to the string literal. It translates bytes of input string to meaningful characters (i.e. Unicode).

Then you are calling encode to convert this characters to bytes, since you need to "print", output them to the terminal or any other OS entity (like GUI window).

So, about your specific task, I believe you want something like:

s = '\x80'
print s.decode('cp1251').encode(platform_encoding)

where 'cp1251' is encoding of your IDE, and platform_encoding is a variable with encoding of current system.

In the reply to your comment:

But the str.decode should have used the source code encoding (from line 2 in the file) to decode. So there should not be a difference to the u

This is incorrect assumption. From Defining Python Source Code Encodings

The encoding information is then used by the Python parser to interpret the file using the given encoding.

So set fileencoding=cp1252 just tells the interpreter how to convert characters [you entered via editor] to bytes when it parses line str = '\x80'. This information is not used during str.decode calls.

Also you are asking, what u'\x80' is? \x80 is simply interpretered as \u0080, and this is obviously not what you want. Take a look on this question - Bytes in a unicode Python string.

UnicodeEncodeError : 'charmap' codec can't encode character '\x80' in position 0 : character maps to <undefined>

Answers (2)

Related Questions

UnicodeEncodeError : &#39;charmap&#39; codec can&#39;t encode character &#39;\x80&#39; in position 0 : character maps to &lt;undefined&gt;

Answers (2)

Related Questions

UnicodeEncodeError : 'charmap' codec can't encode character '\x80' in position 0 : character maps to <undefined>