encodings.utf_8.StreamReader readline(), read() and seek() don't cooperate

Question

Consider this very simple example.

import codecs
from io import BytesIO

string = b"""# test comment
Some line without comment
# another comment
"""

reader = codecs.getreader("UTF-8")
stream = reader(BytesIO(string))

lines = []
while True:
    # get current position
    position = stream.tell()

    # read first character
    char = stream.read(1)

    # return cursor to start
    stream.seek(position, 0)

    # end of stream
    if char == "":
        break

    # line is not comment
    if char != "#":
        lines.append(stream.readline())
        continue

    # line is comment. Skip it.
    stream.readline()

print(lines)
assert lines == ["Some line without comment
"]

I am trying to read line by line from StreamReader and if the line starts with # I skip it otherwise I store it in a list. But there is some strange behaviour when I use seek() method. It seems like seek() and readline() don't cooperate and move cursor somewhere far away. The result list is empty.

Of course I could do it in different way. But as I wrote above this is a very simple example and it helps me understand how things work together.

I use Python 3.5.

Martijn Pieters · Accepted Answer

You don't want to use codecs stream readers. They are an older, outdated attempt at implementing layered I/O to handled encoding and decoding of text, since superseded by the io module, a much more robust and faster implementation. There have been serious calls for the stream readers to be deprecated.

You really want to replace your use of codecs.getreader() with the io.TextIOWrapper() object:

import codecs
from io import BytesIO, TextIOWrapper

string = b"""# test comment
Some line without comment
# another comment
"""

stream = TextIOWrapper(BytesIO(string))

at which point the while loop works and lines ends up as ['Some line without comment '].

You also don't need to use seeking or tell() here. You could just loop directly over a file object (including a TextIOWrapper() object):

lines = []
for line in stream:
    if not line.startswith('#'):
        lines.append(line)

or even:

lines = [l for l in stream if not l.startswith('#')]

If you are concerned about the TextIOWrapper() wrapper object closing the underlying stream when you no longer need the wrapper, just detach the wrapper first:

stream.detach()

encodings.utf_8.StreamReader readline(), read() and seek() don't cooperate

Answers (2)

Related Questions

encodings.utf_8.StreamReader readline(), read() and seek() don&#39;t cooperate

Answers (2)

Related Questions

encodings.utf_8.StreamReader readline(), read() and seek() don't cooperate