Iguananaut
Iguananaut

Reputation: 23346

Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy

This is perhaps a micro-optimization, but I would like to check that a stream of given bytes is valid UTF-8 as it passes through my application, but I don't want to keep the resulted decoded code points. In other words, if I were to call large_string.decode('utf-8'), assuming the encoding succeeds I have no desire to keep the unicode string returned by decoding, and would prefer not to waste memory on it.

There are various ways I could do this, for example read a few bytes at a time, attempt to decode(), then append more bytes until decode() succeeds (or I've exhausted the maximum number of bytes for a single character in the encoding). But ISTM it should be possible to use the existing decoder in a way that simply throws away the decoded unicode characters and not have to roll my own. But nothing immediately comes to mind scouring the stdlib docs.

Upvotes: 1

Views: 1891

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123350

You can use the incremental decoder provided by the codecs module:

utf8_decoder = codecs.getincrementaldecoder('utf8')()

This is a IncrementalDecoder() instance. You can then feed this decoder data in order and validate the stream:

# for each partial chunk of data:
    try:
        utf8_decoder.decode(chunk)
    except UnicodeDecodeError:
        # invalid data

The decoder returns the data decoded so far (minus partial multi-byte sequences, those are kept as state for the next time you decode a chunk). Those smaller strings are cheap to create and discard, you are not creating a large string here.

You can't feed the above loop partial data, because UTF-8 is a format using a variable number of bytes; a partial chunk is liable to have invalid data at the start.

If you can't validate from the start, then your first chunk may start with up to three continuation bytes. You could just remove those first:

first_chunk = b'....'
for _ in range(3):
    if first_chunk[0] & 0xc0 == 0x80:
        # remove continuation byte
        first_chunk = first_chunk[1:]

Now, UTF-8 is structured enough so you could also validate the stream entirely in Python code using more such binary tests, but you simply are not going to match the speed that the built-in decoder can decode at.

Upvotes: 6

Related Questions