Tim Petri
Tim Petri

Reputation: 105

How can I decode a utf-8 byte array to a string in Python2?

I have an array of bytes representing a utf-8 encoded string. I want to decode these bytes back into the string in Pyton2. I am relying on Python2 for my overall program, so I can not switch to Python3.

array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97] 

-> Café Flora

Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like:

"".join(map(chr, array))

I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. Finally it would use unichr() to decode it.

result = []


for i in range(len(byte_array)):
    x = byte_array[i]
    if x < 0:
        b16 = x & 0xFFFF # 16 bit
        b16 = b16 << 8
        b16 = b16 | byte_array[i+1]
        result.append(unichr(m16))
    else:
        result.append(chr(x))

return "".join(result)

However, this was unsuccessful.

The following article explains the issue very well, and includes a nodeJS solution:

http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html

Upvotes: 5

Views: 8649

Answers (3)

jsbueno
jsbueno

Reputation: 110311

You have to have in mind that a "string" in Python2 is not proper text, just a sequence of bytes in memory, which happens to map to characters when you "print" them - if the mapping of the intend characters in the byte sequence matches the one in the terminal, you will see properly formatted text.

If your terminal is not UTF-8, even if you get the proper byte-strign in memory, just printing it would show you the wrong results. That is why the extra "decode" step is needed at the end of the expression.

text = b''.join(chr(i if i > 0 else 256 + i) for i in array).decode('utf-8')

As your source encoded the numbers between 128 and 255 as negative numbers, we have the inline "if" operator to renormalize the value before calling "chr".

Just to be clear - you say "Since every character in the string I want is not necessarily represented by exactly 1 byte in the array," - So - what takes care of that if you use Python2.x strings, is the terminal anyway. If you want to deal with proper tet, after joining your numbers to a proper (byte) string, is to use the "decode" method - this is the part that will know about UTF-8 multi-byte encoded characters and give you back a (text) string object (an 'unicode' object in Python 2) - that will treat each character as an entity.

Upvotes: 1

user2357112
user2357112

Reputation: 281262

Use the little-used array module to convert your input to a bytestring and then decode it with the UTF-8 codec:

import array
decoded = array.array('b', your_input).tostring().decode('utf-8')

Upvotes: 3

Joran Beasley
Joran Beasley

Reputation: 113998

you can use struct.pack for this

>>> a =  [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora

Upvotes: 2

Related Questions