How to permissively decode a UTF-8 bytearray?

Question

I need to decode a UTF-8 sequence, which is stored in a bytearray, to a string.

The UTF-8 sequence might contain erroneous parts. In this case I need to decode as much as possible and (optionally?) substitute invalid parts by something like "?".

# First part decodes to "ABÄC"
b = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43])
s = str(b, "utf-8") 
print(s)

# Second part, invalid sequence, wanted to decode to something like "AB?C"
b = bytearray([0x41, 0x42, 0xC3, 0x43])
s = str(b, "utf-8")
print(s)

What's the best way to achieve this in Python 3?

schesis · Accepted Answer

There are several builtin error-handling schemes for encoding and decoding str to and from bytes and bytearray with e.g. bytearray.decode(). For example:

>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])

>>> b.decode('utf8', errors='ignore')  # discard malformed bytes
'ABC'

>>> b.decode('utf8', errors='replace')  # replace with U+FFFD
'AB�C'

>>> b.decode('utf8', errors='backslashreplace')  # replace with backslash-escape
'AB\xc3C'

In addition, you can write your own error handler and register it:

import codecs

def my_handler(exception):
    """Replace unexpected bytes with '?'."""
    return '?', exception.end

codecs.register_error('my_handler', my_handler)

>>> b.decode('utf8', errors='my_handler')
'AB?C'

All of these error handling schemes can also be used with the str() constructor as in your question:

>>> str(b, 'utf8', errors='my_handler')
'AB?C'

... although it's more idiomatic to use str.decode() explicitly.

How to permissively decode a UTF-8 bytearray?

Answers (1)

Related Questions