UTF-16 Encoding and Japanese in Python 3

Question

I am trying to encode Japanese characters to UTF-16; basically mimic what this online tool does: https://www.branah.com/unicode-converter

For example,

'インスタントグラム'

should become

'\u30a4\u30f3\u30b9\u30bf\u30f3\u30c8\u30b0\u30e9\u30e0'

I am using the following block of code:

jp_example = 'インスタントグラム'
jp_example.encode('utf-16')

and instead receive output that looks like this:

b'\xff\xfe\xa40\xf30\xb90\xbf0\xf30\xc80\xb00\xe90\xe00'

Any idea what I'm missing? I played around with other encodings and nothing has worked for me.

FWIW, I am using a Jupyter Notebook with Python 3.6.3rc1+ .

Martijn Pieters · Accepted Answer

Your expected output is not UTF-16. UTF-16 is an encoding that uses 2 bytes per codepoint; イ, Unicode codepoint U+30A4 KATAKANA LETTER I, when represented in UTF-16 bytes as A4 30 or 30 A4 hexadecimal, depending on the byte order the encoder picked.

Instead, your expected output consists of Unicode codepoints embedded in \u escapes. Such escapes are used in multiple contexts, including Python string literals and JSON data.

If you are producing JSON data, use json.dumps() to create a JSON string; any codepoints in that string outside of the ASCII character set are represented with \uhhhh escape sequences:

>>> jp_example = 'インスタントグラム'
>>> import json
>>> print(json.dumps(jp_example))
"\u30a4\u30f3\u30b9\u30bf\u30f3\u30c8\u30b0\u30e9\u30e0"

Otherwise, if you are generating Python string literals, use the unicode_escape codec; this outputs a byte sequence too; for printing purposes I've decoded those bytes to text again using the ASCII codec

>>> print(jp_example.encode('unicode_escape').decode('ascii'))
\u30a4\u30f3\u30b9\u30bf\u30f3\u30c8\u30b0\u30e9\u30e0

You need to be absolutely certain as to what your data is used for. JSON and Python string literal notation differs when it comes to codepoints outside of the Basic Multi-lingual Plane, such as most Emoji:

>>> print(json.dumps('🐱👤'))
"\ud83d\udc31\ud83d\udc64"
>>> print('🐱👤'.encode('unicode_escape').decode('ascii'))
\U0001f431\U0001f464

JSON uses surrogate pairs to represent such codepoints, while Python uses a \Uhhhhhhhh 8-hex-digit escape sequence.

And just to be explicit: what that unicode-converter site produces is not helpful and outright misleading. The 'UTF-16' box produces JSON-notation escape sequences, or UTF-16 little-endian hex values when you check the Remove \u box, without a byte order mark. What the u+ markup for the UTF-32 output is supposed to do I don't quite understand, and the UTF-8 box outputs a UTF-8-to-Latin-1 Mojibake. I would not use that site.

UTF-16 Encoding and Japanese in Python 3

Answers (1)

Related Questions