Victor Mezrin
Victor Mezrin

Reputation: 2847

python 3, unicode conversion, two \u0000 as one character

My python3 script receives strings from c++ program via pipe. Strings encoded via Unicode code points. I need to decode it correctly.

For example, consider string that contain cyrillic symbols: 'тест test'

Try to encode this string using python3: print('тест test'.encode()). We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'

C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'

Encoded strings looks very similar - python3 uses \x (2bits) and c++ program uses \u (4bits). But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test' to 'тест test'. Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082' as 8-character string, but it contain only 4 characters

Upvotes: 1

Views: 8600

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177901

If the string you receive from C++ is the following in Python:

s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'

Then this will decode it:

result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)

Output:

тест test

The first stage converts the byte string received into a Unicode string:

>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'

Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1 encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:

>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'

Now the byte string can be decoded to the correct Unicode string:

>>> s3 = s2.decode('utf8')
>>> s3
'тест test'

Upvotes: 4

Related Questions