python decode partial utf-8 byte array

Question

I'm getting data from channel which is not aware about UTF-8 rules. So sometimes when UTF-8 is using multiple bytes to code one character and I try to convert part of received data into text I'm getting error during conversion. By nature of interface (stream without any end) I'm not able to find out when data are full. Thus I need to handle partial utf-8 decoding. Basically I need to decode what I can and store partial data. Stored partial data will be added as prefix to next data. My question is if there is some neat function in python to allow it?

[EDIT] Just to ensure you I know about function in docs.python

 bytes.decode(encoding="utf-8", errors="ignore")

but the issue is it would not return me where is the error and so I can not know how much bytes from end I shall keep.

Serge Ballesta · Accepted Answer

You can call the codecs module to the rescue. It gives you directly a incremental decoder, that does exactly what you need:

import codecs

dec = codecs.getincrementaldecoder('utf8')()

You can feed it with: dec.decode(input) and when it is over, optionally add a dec.decode(bytes(), True) to force it to cleanup any stored state.

The test becomes:

>>> def test(arr):
    dec = codecs.getincrementaldecoder('utf8')()
    recvString = ""
    for i in range(len(arr)):
        recvString += dec.decode(arr[i:i+1])
        sys.stdout.write("%02d : %s
" % (i, recvString))
    recvString += dec.decode(bytes(), True) # will choke on incomplete input...
    return recvString == arr.decode('utf8')

>>> testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
>>> test(testUtf8)
00 : a
01 : a
02 : aŽ
03 : aŽl
04 : aŽlu
05 : aŽlu
06 : aŽluť
07 : aŽluťo
08 : aŽluťou
09 : aŽluťou
10 : aŽluťouč
11 : aŽluťoučk
12 : aŽluťoučk
13 : aŽluťoučký
14 : aŽluťoučký 
15 : aŽluťoučký k
16 : aŽluťoučký k
17 : aŽluťoučký ků
18 : aŽluťoučký ků
19 : aŽluťoučký kůň
True

python decode partial utf-8 byte array

Answers (2)

Related Questions