Eli
Eli

Reputation: 38899

Why does this still look like bytes after I convert it to Unicode?

I have everything working as I want it in my code, but I'm still curious. I have a string: "stación." When I convert that string to unicode, I get:

unicode('stación', 'utf-8')
>>> u'staci\xf3n'

That "\xf3" in there looks like a byte character. This is different from:

unicode('Поиск', 'utf-8')
>>> u'\u041f\u043e\u0438\u0441\u043a'

In the latter example, as with everything I've converted to unicode before, I get unicode characters starting with "\u." Normally, when I see a byte starting with "\x," I think there's a problem. What gives here? Is this because "ó" is extended ASCII?

Upvotes: 0

Views: 397

Answers (2)

jfs
jfs

Reputation: 414079

u'\xf3' is not a byte; it is a Unicode string with a single Unicode codepoint (U+00f3 LATIN SMALL LETTER O WITH ACUTE).

What you see (u'\xf3') is how Python 2 chooses to represent Unicode character with Unicode ordinals (integers) in the range 0..255 that are not printable ascii characters (Python 3 would show 'ó' here, only non-printable characters use '\xhh' form there by default). As @Ignacio Vazquez-Abrams said: u'\u00f3' and u'\U000000f3' literals create exactly the same Unicode string.

You could see how the Unicode character (u'\xf3') looks like bytes in different character encodings for comparision:

>>> print(u'\xf3')
ó
>>> u'\xf3'.encode('utf-8')
b'\xc3\xb3'
>>> u'\xf3'.encode('utf-16be')                                                                            
b'\x00\xf3'
>>> u'\xf3'.encode('utf-32le')                                                                            
b'\xf3\x00\x00\x00'
>>> u'\xf3'.encode('cp1252')
b'\xf3'

Note: b'\xf3' and u'\xf3' are different things. The former is a bytestring that contains a single byte (an integer 243), the latter is a Unicode string that contains a single Unicode codepoint (Unicode ordinal 243). The number is the same 243 by the units are different -- 100 calories is not the same thing as 100 grams.

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798486

No, it's because "ó" is a non-ASCII character within the first 255 characters. Since it's representable using a single byte, we save 2 characters in the representation. The other two representations are valid, but not required.

>>> u'\u00f3'
u'\xf3'
>>> u'\U000000f3'
u'\xf3'

Upvotes: 2

Related Questions