tinutomson
tinutomson

Reputation: 329

Understanding unicode and encoding in Python

When I enter following in the python 2.7 console

>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'

I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.

Upvotes: 0

Views: 263

Answers (1)

wim
wim

Reputation: 363456

u'áíóús' is a string of text. What you see echoed in the REPL is the canonical representation of that object:

>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'

The things like \xe1 are related to hexadecimal ordinals of each character:

>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']

Only the last character was in the ascii range, i.e. ordinals in range(128), so only that last character "s" is plainly visible in Python 2.x:

>>> chr(0x73)
's'

'áíóús' is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:

>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'

Upvotes: 4

Related Questions