Reputation: 105
I have an array of bytes representing a utf-8 encoded string. I want to decode these bytes back into the string in Pyton2. I am relying on Python2 for my overall program, so I can not switch to Python3.
array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97]
-> Café Flora
Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like:
"".join(map(chr, array))
I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. Finally it would use unichr() to decode it.
result = []
for i in range(len(byte_array)):
x = byte_array[i]
if x < 0:
b16 = x & 0xFFFF # 16 bit
b16 = b16 << 8
b16 = b16 | byte_array[i+1]
result.append(unichr(m16))
else:
result.append(chr(x))
return "".join(result)
However, this was unsuccessful.
The following article explains the issue very well, and includes a nodeJS solution:
http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html
Upvotes: 5
Views: 8649
Reputation: 110311
You have to have in mind that a "string" in Python2 is not proper text, just a sequence of bytes in memory, which happens to map to characters when you "print" them - if the mapping of the intend characters in the byte sequence matches the one in the terminal, you will see properly formatted text.
If your terminal is not UTF-8, even if you get the proper byte-strign in memory, just printing it would show you the wrong results. That is why the extra "decode" step is needed at the end of the expression.
text = b''.join(chr(i if i > 0 else 256 + i) for i in array).decode('utf-8')
As your source encoded the numbers between 128 and 255 as negative numbers, we have the inline "if" operator to renormalize the value before calling "chr".
Just to be clear - you say "Since every character in the string I want is not necessarily represented by exactly 1 byte in the array," - So - what takes care of that if you use Python2.x strings, is the terminal anyway. If you want to deal with proper tet, after joining your numbers to a proper (byte) string, is to use the "decode" method - this is the part that will know about UTF-8 multi-byte encoded characters and give you back a (text) string object (an 'unicode' object in Python 2) - that will treat each character as an entity.
Upvotes: 1
Reputation: 281262
Use the little-used array
module to convert your input to a bytestring and then decode
it with the UTF-8 codec:
import array
decoded = array.array('b', your_input).tostring().decode('utf-8')
Upvotes: 3
Reputation: 113998
you can use struct.pack
for this
>>> a = [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora
Upvotes: 2