arcyqwerty
arcyqwerty

Reputation: 10685

Read utf-8 character from byte stream

Given a stream of bytes (generator, file, etc.) how can I read a single utf-8 encoded character?

I could approach this by rolling my own utf-8 decoding function but I would prefer not to reinvent the wheel since I'm sure this functionality must already be used elsewhere to parse utf-8 strings.

Upvotes: 5

Views: 1075

Answers (1)

Kevin
Kevin

Reputation: 30151

Wrap the stream in a TextIOWrapper with encoding='utf8', then call .read(1) on it.

This is assuming you started with a BufferedIOBase or something duck-type compatible with it (i.e. has a read() method). If you have a generator or iterator, you may need to adapt the interface.

Example:

from io import TextIOWrapper

with open('/path/to/file', 'rb') as f:
  wf = TextIOWrapper(f, 'utf-8')
  wf._CHUNK_SIZE = 1  # Implementation detail, may not work everywhere

  wf.read(1) # gives next utf-8 encoded character
  f.read(1)  # gives next byte

Upvotes: 2

Related Questions