user5390281
user5390281

Reputation:

Encoding unicode with 'utf-8' shows byte-strings only for non-ascii

I'm running python2.7.10

Trying to wrap my head around why the following behavior is seen. (Sure there is a reasonable explanation)

So I define two unicode characters, with only the first one in the ascii-set, and the second one outside of it.

>>> a=u'\u0041'
>>> b=u'\u1234'
>>> print a
A
>>> print b
ሴ

Now I encode it to see what the corresponding bytes would be. But only the latter gives me the results I am expecting to see (bytes)

>>> a.encode('utf-8')
'A'
>>> b.encode('utf-8')
'\xe1\x88\xb4'

Perhaps the issue is in my expectation, and if so, one of you can explain where the flaw lies. - My a,b are unicodes (hex values of the ordinals inside) - When I print these, the interpreter prints the actual character corresponding to each unicode byte. - When I encode, I assumed that it would be converted into a byte-string using the encoding scheme I provide (in this case utf-8). I expected to see a bytestring for a.encode, just like I did for b.encode.

What am I missing?

Upvotes: 1

Views: 849

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121216

There is no flaw. You encoded to UTF-8, which uses the same bytes as the ASCII standard for the first 127 codepoints of the Unicode standard, and uses multiple bytes (between 2 and 4) for everything else.

You then echoed that value in your terminal, which uses the repr() function to build a debugging representation. That representation produces a valid Python expression for strings, one that is ASCII safe. Any bytes in that value that is not printable as an ASCII character, is shown as an escape sequence. Thus UTF-8 bytes are shown as \xhh hex escapes.

Most importantly, because A is a printable ASCII character, it is shown as is; any code editor or terminal will accept ASCII, and for most English text showing the actual text is so much more useful.

Note that you used print for the unicode values stored in a and b, which means Python encodes those values to your terminal codec, coordinating with your terminal to produce the right output. You did not echo the values in the interpreter. Had you done so, you'd also seen debug output:

>>> a = u'\u0041'
>>> b = u'\u1234'
>>> a
u'A'
>>> b
u'\u1234'

In Python 3, the functionality of the repr() function (or rather, the object.__repr__ hook) has been updated to produce a unicode string with most printable codepoints left un-escaped. Use the new ascii() function to get the above behaviour.

Upvotes: 3

Related Questions