Reputation: 2847
My python3 script receives strings from c++ program via pipe. Strings encoded via Unicode code points. I need to decode it correctly.
For example, consider string that contain cyrillic symbols: 'тест test'
Try to encode this string using python3: print('тест test'.encode())
. We got b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
C++ program encodes this string like: b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Encoded strings looks very similar - python3 uses \x
(2bits) and c++ program uses \u
(4bits).
But I can't figure out how to convert b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
to 'тест test'
.
Main problem - python3 consider b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082'
as 8-character string, but it contain only 4 characters
Upvotes: 1
Views: 8600
Reputation: 177901
If the string you receive from C++ is the following in Python:
s = b'\u00D1\u0082\u00D0\u00B5\u00D1\u0081\u00D1\u0082 test'
Then this will decode it:
result = s.decode('unicode-escape').encode('latin1').decode('utf8')
print(result)
Output:
тест test
The first stage converts the byte string received into a Unicode string:
>>> s1 = s.decode('unicode-escape')
>>> s1
'Ñ\x82еÑ\x81Ñ\x82 test'
Unfortunately, the Unicode codepoints are really UTF-8 byte values. The latin1
encoding is a 1:1 mapping of the first 256 Unicode codepoints, so encoding with this codec converts the codepoints back to byte values in a byte string:
>>> s2 = s1.encode('latin1')
>>> s2
b'\xd1\x82\xd0\xb5\xd1\x81\xd1\x82 test'
Now the byte string can be decoded to the correct Unicode string:
>>> s3 = s2.decode('utf8')
>>> s3
'тест test'
Upvotes: 4