Reputation: 38899
I have everything working as I want it in my code, but I'm still curious. I have a string: "stación." When I convert that string to unicode, I get:
unicode('stación', 'utf-8')
>>> u'staci\xf3n'
That "\xf3" in there looks like a byte character. This is different from:
unicode('Поиск', 'utf-8')
>>> u'\u041f\u043e\u0438\u0441\u043a'
In the latter example, as with everything I've converted to unicode before, I get unicode characters starting with "\u." Normally, when I see a byte starting with "\x," I think there's a problem. What gives here? Is this because "ó" is extended ASCII?
Upvotes: 0
Views: 397
Reputation: 414079
u'\xf3'
is not a byte; it is a Unicode string with a single Unicode codepoint (U+00f3 LATIN SMALL LETTER O WITH ACUTE
).
What you see (u'\xf3'
) is how Python 2 chooses to represent Unicode character with Unicode ordinals (integers) in the range 0..255
that are not printable ascii characters (Python 3 would show 'ó'
here, only non-printable characters use '\xhh'
form there by default). As @Ignacio Vazquez-Abrams said: u'\u00f3'
and u'\U000000f3'
literals create exactly the same Unicode string.
You could see how the Unicode character (u'\xf3'
) looks like bytes in different character encodings for comparision:
>>> print(u'\xf3')
ó
>>> u'\xf3'.encode('utf-8')
b'\xc3\xb3'
>>> u'\xf3'.encode('utf-16be')
b'\x00\xf3'
>>> u'\xf3'.encode('utf-32le')
b'\xf3\x00\x00\x00'
>>> u'\xf3'.encode('cp1252')
b'\xf3'
Note: b'\xf3'
and u'\xf3'
are different things. The former is a bytestring that contains a single byte (an integer 243
), the latter is a Unicode string that contains a single Unicode codepoint (Unicode ordinal 243
). The number is the same 243
by the units are different -- 100 calories is not the same thing as 100 grams.
Upvotes: 0
Reputation: 798486
No, it's because "ó" is a non-ASCII character within the first 255 characters. Since it's representable using a single byte, we save 2 characters in the representation. The other two representations are valid, but not required.
>>> u'\u00f3'
u'\xf3'
>>> u'\U000000f3'
u'\xf3'
Upvotes: 2