Reputation: 329
When I enter following in the python 2.7 console
>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'
I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.
Upvotes: 0
Views: 263
Reputation: 363456
u'áíóús'
is a string of text. What you see echoed in the REPL is the canonical representation of that object:
>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'
The things like \xe1
are related to hexadecimal ordinals of each character:
>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']
Only the last character was in the ascii range, i.e. ordinals in range(128)
, so only that last character "s" is plainly visible in Python 2.x:
>>> chr(0x73)
's'
'áíóús'
is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:
>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
Upvotes: 4