herpderp
herpderp

Reputation: 16177

decode('utf-16') does not decode correctly

I have the utf-16 string \u0423\u043a\u0440\u0430\u0438\u043d\u0430. This encodes Украина, you can verify it with any online utf16 decoder.

But trying to decode it in python:

print(b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430".decode('utf16'))

outputs: 畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳

Why?

Upvotes: 1

Views: 706

Answers (2)

Karl Knechtel
Karl Knechtel

Reputation: 61527

If you have "\u0423\u043a\u0440\u0430\u0438\u043d\u0430", then that doesn't "encode" Украина; it is equal to Украина. It is not "a utf-16 string"; it is a string. There is no such thing as "a {name of encoding} string".

There are "bytes which encode a string using ". bytes objects are not text. Strings are not byte sequences. They are not meaningfully related types, and only appear to be for legacy reasons (Relatively unfortunate ones, honestly; the first Unicode standard was closer in time to the first ASCII standard than to today.) In that light, the literal syntax for bytes objects, as well as their canonical repr, is rather unfortunate; but that's what we have to live with.

If you have b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430", then you have a bytes object, created using a literal. The literal syntax for bytes does not assign any particular meaning for \u escapes. The u in \u stands for Unicode. In a string, that sequence represents a Unicode code point. bytes objects cannot contain those - they contain bytes - so there is no reason for them to support the escape sequence. As usual for things that Python encloses in quotation marks, a backslash followed by something with no special meaning, is just a backslash (even though you normally should double up backslashes to escape them). Of course, within a bytes, a backslash symbol doesn't represent the backslash text character, because bytes objects don't store text. Instead, for those historical reasons, it represents the integral value 92.

If you want to create a bytes object that contains bytes which represent a string in UTF-16 encoding, then:

  • Determine whether you actually mean UTF-16-LE or UTF-16-BE. It is necessary to choose.
  • for each Unicode code point of the string, if it's in the Basic Multilingual Plane, find the corresponding 16-bit value; for other characters, use a surrogate pair of two such values.
  • For each 16-bit value, represent it as two bytes, with the order determined by the desired endianness (-LE or -BE).
  • For each of those bytes, represent it in the bytes literal syntax with a \x escape sequence (backlash, lowercase x, and two hexadecimal digits). (Or use the corresponding ASCII character, if applicable.
  • OR, create a list (or other iterable) of the byte values, and pass it to the bytes constructor.

If you want to read bytes from a file that represent a UTF-16 string, assuming you know the endianness, then it's simple:

  • Open the file in binary mode and read whatever number of bytes.
  • To get the corresponding string, use the .decode method of the bytes object.

If you have b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430" and wish to treat it as "\u0423\u043a\u0440\u0430\u0438\u043d\u0430"; and for some reason you cannot fix the process which gave you this bad input, then that is what the unicode-escape codec is for.

Upvotes: 0

SuperStormer
SuperStormer

Reputation: 5387

That's not a "utf-16" string, its just a regular unicode-escaped string. print("\u0423\u043a\u0440\u0430\u0438\u043d\u0430") prints out the correct output without needing to decode anything.

However, if you actually have a bytestring with the literal bytes "\", "u", "0", "4", etc. for some reason, use print(b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430".decode("unicode-escape")).

Upvotes: 4

Related Questions