Klas Lindberg
Klas Lindberg

Reputation: 61

Why can u'\xe5' be decoded but not '\xe5'?

This is flabbergasting and extremely frustrating, please help.

>>> a1 = '\xe5'   # type <str>
>>> a2 = u'\xe5'  # type <unicode>
>>> ord(a1)
229
>>> ord(a2)
229
>>> print a2.encode('utf-8')
å
>>> print a1.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

If a1 and a2 have the same value, why can't both be encoded?

I have to use an external API that returns unicode data on the a1 form, which makes it useless. Python apparently insists that <str> typed strings must only contain ASCII chars or it refuses to encode them. It completely breaks my application.

Upvotes: 3

Views: 4542

Answers (4)

Klas Lindberg
Klas Lindberg

Reputation: 61

Ignacio's suggestion to decode the byte string from its actual encoding (not ascii, but what?) got me to try with latin-1 even though I think it should have been utf-8. That worked!

I get the data from the Python2.7 curses module. My best guess is the problem is in there somewhere. The terminal's encoding is utf-8, but ok it works now.

Upvotes: 0

GIZ
GIZ

Reputation: 4643

Let me tear down your confusion to pieces. Let's start first by the the distinction between str and unicode. In Python 2.X:

  1. str is a string of 8-bit characters (1-byte) that prints as ASCII whenever possible. str is really a sequence of bytes and is the equivalent of bytes in Python 3.X. *There's no encoding for str.
  2. unicode is a string of Unicode code-points.

Second, encoding means according to Python documentation:

"The rules for translating a Unicode string into a sequence of bytes are called an encoding."

Then, ask yourself this question: does it makes sense to encode str which is already a sequence of bytes? The answer is no. Because str is already a sequence of bytes. It does make sense however to encode unicode, why? Because it's a string of Unicode character code-points (i.e, U+00E4').

Upvotes: 0

Attie
Attie

Reputation: 6979

In python2, strings are ASCII, while in python3 strings are Unicode.

ASCII characters may only have a value between 0 and 127 inclusive. Unicode characters however may have a much higher value.

python2:

>>> a = '\x7f'
>>> a.encode('utf-8')
'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

python3:

>>> a = '\x7f'
>>> a.encode('utf-8')
b'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
b'\xc2\x80'

The reason that this works in python2 with the u prefix is because you are explicitly stating that "this is a Unicode string".


It might be worth reading up for a more in-depth understanding of using Unicode in python2:


To make use of the (broken) API, it would be best to convert the returned string into a bytearray, but note, this will not work in python3.

>>> a = '\xe5'
>>> b = bytearray(a)
>>> b[0]
229

Remember, that \xe5 is not a valid Unicode (UTF-8) character... To store the value 0xE5 in a UTF-8 encoded string, you'd need to store two bytes: 0xC3 0xA5.

Upvotes: 1

Daniel Roseman
Daniel Roseman

Reputation: 599778

You can only encode Unicode strings. If you call encode on a bytestring, Python tries to decode it first, using the default encoding - hence the error. (Note that this confusing behaviour only occurs in Python 2, it has been removed in Python 3).

Upvotes: 3

Related Questions