Reputation: 61
This is flabbergasting and extremely frustrating, please help.
>>> a1 = '\xe5' # type <str>
>>> a2 = u'\xe5' # type <unicode>
>>> ord(a1)
229
>>> ord(a2)
229
>>> print a2.encode('utf-8')
å
>>> print a1.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
If a1 and a2 have the same value, why can't both be encoded?
I have to use an external API that returns unicode data on the a1
form, which makes it useless. Python apparently insists that <str>
typed strings must only contain ASCII chars or it refuses to encode them. It completely breaks my application.
Upvotes: 3
Views: 4542
Reputation: 61
Ignacio's suggestion to decode the byte string from its actual encoding (not ascii, but what?) got me to try with latin-1 even though I think it should have been utf-8. That worked!
I get the data from the Python2.7 curses module. My best guess is the problem is in there somewhere. The terminal's encoding is utf-8, but ok it works now.
Upvotes: 0
Reputation: 4643
Let me tear down your confusion to pieces. Let's start first by the the distinction between str
and unicode
. In Python 2.X:
str
is a string of 8-bit characters (1-byte) that prints as ASCII whenever possible. str
is really a sequence of bytes and is the equivalent of bytes
in Python 3.X. *There's no encoding for str
.unicode
is a string of Unicode code-points. Second, encoding means according to Python documentation:
"The rules for translating a Unicode string into a sequence of bytes are called an encoding."
Then, ask yourself this question: does it makes sense to encode str
which is already a sequence of bytes? The answer is no. Because str
is already a sequence of bytes. It does make sense however to encode unicode
, why? Because it's a string of Unicode character code-points (i.e, U+00E4').
Upvotes: 0
Reputation: 6979
In python2
, strings are ASCII, while in python3
strings are Unicode.
ASCII characters may only have a value between 0 and 127 inclusive. Unicode characters however may have a much higher value.
python2
:
>>> a = '\x7f'
>>> a.encode('utf-8')
'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
python3
:
>>> a = '\x7f'
>>> a.encode('utf-8')
b'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
b'\xc2\x80'
The reason that this works in python2
with the u
prefix is because you are explicitly stating that "this is a Unicode string".
It might be worth reading up for a more in-depth understanding of using Unicode in python2
:
To make use of the (broken) API, it would be best to convert the returned string into a bytearray, but note, this will not work in python3
.
>>> a = '\xe5'
>>> b = bytearray(a)
>>> b[0]
229
Remember, that \xe5
is not a valid Unicode (UTF-8) character... To store the value 0xE5
in a UTF-8 encoded string, you'd need to store two bytes: 0xC3 0xA5
.
Upvotes: 1
Reputation: 599778
You can only encode Unicode strings. If you call encode on a bytestring, Python tries to decode it first, using the default encoding - hence the error. (Note that this confusing behaviour only occurs in Python 2, it has been removed in Python 3).
Upvotes: 3