Vit Bernatik
Vit Bernatik

Reputation: 3802

python decode partial utf-8 byte array

I'm getting data from channel which is not aware about UTF-8 rules. So sometimes when UTF-8 is using multiple bytes to code one character and I try to convert part of received data into text I'm getting error during conversion. By nature of interface (stream without any end) I'm not able to find out when data are full. Thus I need to handle partial utf-8 decoding. Basically I need to decode what I can and store partial data. Stored partial data will be added as prefix to next data. My question is if there is some neat function in python to allow it?

[EDIT] Just to ensure you I know about function in docs.python

 bytes.decode(encoding="utf-8", errors="ignore")

but the issue is it would not return me where is the error and so I can not know how much bytes from end I shall keep.

Upvotes: 10

Views: 3361

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148975

You can call the codecs module to the rescue. It gives you directly a incremental decoder, that does exactly what you need:

import codecs

dec = codecs.getincrementaldecoder('utf8')()

You can feed it with: dec.decode(input) and when it is over, optionally add a dec.decode(bytes(), True) to force it to cleanup any stored state.

The test becomes:

>>> def test(arr):
    dec = codecs.getincrementaldecoder('utf8')()
    recvString = ""
    for i in range(len(arr)):
        recvString += dec.decode(arr[i:i+1])
        sys.stdout.write("%02d : %s\n" % (i, recvString))
    recvString += dec.decode(bytes(), True) # will choke on incomplete input...
    return recvString == arr.decode('utf8')

>>> testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
>>> test(testUtf8)
00 : a
01 : a
02 : aŽ
03 : aŽl
04 : aŽlu
05 : aŽlu
06 : aŽluť
07 : aŽluťo
08 : aŽluťou
09 : aŽluťou
10 : aŽluťouč
11 : aŽluťoučk
12 : aŽluťoučk
13 : aŽluťoučký
14 : aŽluťoučký 
15 : aŽluťoučký k
16 : aŽluťoučký k
17 : aŽluťoučký ků
18 : aŽluťoučký ků
19 : aŽluťoučký kůň
True

Upvotes: 17

Vit Bernatik
Vit Bernatik

Reputation: 3802

So far I come up with not so nice function:

def decodeBytesUtf8Safe(toDec):
    """
    decodes byte array in utf8 to string. It can handle case when end of byte array is
    not complete thus making utf8 error. in such case text is translated only up to error.
    Rest of byte array (from error to end) is returned as second parameter and can be
    combined with next byte array and decoded next time.
    :param toDec: bytes array to be decoded a(eg bytes("abc","utf8"))
    :return:
     1. decoded string
     2. rest of byte array which could not be encoded due to error
    """
    okLen = len(toDec)
    outStr = ""
    while(okLen>0):
        try:
            outStr = toDec[:okLen].decode("utf-8")
        except UnicodeDecodeError as ex:
            okLen -= 1
        else:
            break
    return outStr,toDec[okLen:]

you can test it using script:

def test(arr):
    expStr = arr.decode("utf-8")
    errorCnt = 0
    for i in range(len(arr)+1):
        decodedTxt, rest = decodeBytesUtf8Safe(arr[0:i])
        decodedTxt2, rest2 = decodeBytesUtf8Safe(rest+arr[i:])
        recvString = decodedTxt+decodedTxt2
        sys.stdout.write("%02d ; %s (%s - %s )\n"%(i,recvString,decodedTxt, decodedTxt2))
        if(expStr != recvString):
            print("Error when divided at %i"%(i))
            errorCnt += 1
    return errorCnt

testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
err = test(testUtf8)
print("total errors %i"%(err))

it shall give you the output:

00 ; aŽluťoučký kůň ( - aŽluťoučký kůň )
01 ; aŽluťoučký kůň (a - Žluťoučký kůň )
02 ; aŽluťoučký kůň (a - Žluťoučký kůň )
03 ; aŽluťoučký kůň (aŽ - luťoučký kůň )
04 ; aŽluťoučký kůň (aŽl - uťoučký kůň )
05 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
06 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
07 ; aŽluťoučký kůň (aŽluť - oučký kůň )
08 ; aŽluťoučký kůň (aŽluťo - učký kůň )
09 ; aŽluťoučký kůň (aŽluťou - čký kůň )
10 ; aŽluťoučký kůň (aŽluťou - čký kůň )
11 ; aŽluťoučký kůň (aŽluťouč - ký kůň )
12 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
13 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
14 ; aŽluťoučký kůň (aŽluťoučký -  kůň )
15 ; aŽluťoučký kůň (aŽluťoučký  - kůň )
16 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
17 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
18 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
19 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
20 ; aŽluťoučký kůň (aŽluťoučký kůň -  )
total errors 0

Upvotes: 1

Related Questions