Reputation: 16177
I have the utf-16 string \u0423\u043a\u0440\u0430\u0438\u043d\u0430
. This encodes Украина
, you can verify it with any online utf16 decoder.
But trying to decode it in python:
print(b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430".decode('utf16'))
outputs: 畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
Why?
Upvotes: 1
Views: 706
Reputation: 61527
If you have "\u0423\u043a\u0440\u0430\u0438\u043d\u0430"
, then that doesn't "encode" Украина
; it is equal to Украина
. It is not "a utf-16 string"; it is a string. There is no such thing as "a {name of encoding} string".
There are "bytes which encode a string using ". bytes
objects are not text. Strings are not byte sequences. They are not meaningfully related types, and only appear to be for legacy reasons (Relatively unfortunate ones, honestly; the first Unicode standard was closer in time to the first ASCII standard than to today.) In that light, the literal syntax for bytes
objects, as well as their canonical repr
, is rather unfortunate; but that's what we have to live with.
If you have b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430"
, then you have a bytes
object, created using a literal. The literal syntax for bytes
does not assign any particular meaning for \u
escapes. The u
in \u
stands for Unicode. In a string, that sequence represents a Unicode code point. bytes
objects cannot contain those - they contain bytes - so there is no reason for them to support the escape sequence. As usual for things that Python encloses in quotation marks, a backslash followed by something with no special meaning, is just a backslash (even though you normally should double up backslashes to escape them). Of course, within a bytes
, a backslash symbol doesn't represent the backslash text character, because bytes
objects don't store text. Instead, for those historical reasons, it represents the integral value 92.
If you want to create a bytes
object that contains bytes which represent a string in UTF-16 encoding, then:
bytes
literal syntax with a \x
escape sequence (backlash, lowercase x, and two hexadecimal digits). (Or use the corresponding ASCII character, if applicable.bytes
constructor.If you want to read bytes from a file that represent a UTF-16 string, assuming you know the endianness, then it's simple:
.decode
method of the bytes
object.If you have b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430"
and wish to treat it as "\u0423\u043a\u0440\u0430\u0438\u043d\u0430"
; and for some reason you cannot fix the process which gave you this bad input, then that is what the unicode-escape
codec is for.
Upvotes: 0
Reputation: 5387
That's not a "utf-16" string, its just a regular unicode-escaped string. print("\u0423\u043a\u0440\u0430\u0438\u043d\u0430")
prints out the correct output without needing to decode anything.
However, if you actually have a bytestring with the literal bytes "\", "u", "0", "4", etc. for some reason, use print(b"\u0423\u043a\u0440\u0430\u0438\u043d\u0430".decode("unicode-escape"))
.
Upvotes: 4