Concatenation of unicode and byte strings

Question

From what I understand, when concatenating a string and Unicode string, Python will automatically decode the string based on the default encoding and convert to Unicode before concatenating.

I'm assuming something like this if default is 'ascii' (please correct if mistaken):

string -> ASCII hexadecimal bytes -> Unicode hexadecimal bytes -> Unicode string

Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating? Why does the string need to be decoded first? Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?

Remy Lebeau · Accepted Answer

Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating?

It could probably do that with literals, but not string characters at runtime. Imagine a string that contains a 'Ӹ' character. How do you think it can be converted to u'Ӹ' in Unicode? IT HAS TO BE DECODED!

Ӹ is Unicode codepoint U+04F8 CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS. 'Ӹ' and u'Ӹ' are not encoded the same way (in fact, I can't even find an 8bit encoding that supports U+04F8), so you can't simply change one to the other directly. A string has to be decoded from its source encoding (ASCII, ISO-8859-1, etc) to an intermediary (ISO 10646, Unicode) that can be represented in the target encoding (UTF-8, UTF-16, UTF-32, etc).

Why does the string need to be decoded first?

Because the two values being concatenated need to be in the same encoding before they can be concatented.

Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?

Because non-ASCII characters are represented differently in different encodings. Unicode is universal, but other encodings are not. And Python supports hundreds of encodings.

Take the Euro sign (€, Unicode codepoint U+20AC), for example. It does not exist in ASCII and most ISO-8859-X encodings, but it is encoded as byte 0xA4 in ISO-8859-7, -15, and -16, but as byte 0x88 in Windows-1251. But 0xA4 represents different Unicode codepoints in other encodings. It is ¤ (U+00A4 CURRENCY SIGN) in ISO-8859-1, but is Ł (U+0141 CAPITAL LETTER L WITH STROKE) in ISO-8859-2, etc.

So how do you expect Python to convert 0xA4 to Unicode? Should it convert to U+00A4, U+0141, or U+20AC?

So, string encoding matters!

See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Concatenation of unicode and byte strings

Answers (1)

Related Questions