kolurbo
kolurbo

Reputation: 538

encodings.utf_8.StreamReader readline(), read() and seek() don't cooperate

Consider this very simple example.

import codecs
from io import BytesIO

string = b"""# test comment
Some line without comment
# another comment
"""

reader = codecs.getreader("UTF-8")
stream = reader(BytesIO(string))

lines = []
while True:
    # get current position
    position = stream.tell()

    # read first character
    char = stream.read(1)

    # return cursor to start
    stream.seek(position, 0)

    # end of stream
    if char == "":
        break

    # line is not comment
    if char != "#":
        lines.append(stream.readline())
        continue

    # line is comment. Skip it.
    stream.readline()

print(lines)
assert lines == ["Some line without comment\n"]

I am trying to read line by line from StreamReader and if the line starts with # I skip it otherwise I store it in a list. But there is some strange behaviour when I use seek() method. It seems like seek() and readline() don't cooperate and move cursor somewhere far away. The result list is empty.

Of course I could do it in different way. But as I wrote above this is a very simple example and it helps me understand how things work together.

I use Python 3.5.

Upvotes: 1

Views: 2369

Answers (2)

ElToro1966
ElToro1966

Reputation: 901

Your code will work if you simply swap

reader = codecs.getreader("UTF-8")
stream = reader(BytesIO(string))

with

stream = BytesIO(string)

EDIT: If you want to use StreamReader, you can get rid of the repositioning with tell(), as stream.read() and stream.readline() are sufficient for repositioning. In other words, with your current code you are repositioning twice.

The changed code in the loop:

    # read first character
    char = stream.read(1)

    # end of stream
    if char == "":
        break

    # line is not comment
    if char != "#":
        lines.append(char + stream.readline())
        continue

    # line is comment. Skip it.
    stream.readline()

Note the change to lines.append()

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121814

You don't want to use codecs stream readers. They are an older, outdated attempt at implementing layered I/O to handled encoding and decoding of text, since superseded by the io module, a much more robust and faster implementation. There have been serious calls for the stream readers to be deprecated.

You really want to replace your use of codecs.getreader() with the io.TextIOWrapper() object:

import codecs
from io import BytesIO, TextIOWrapper

string = b"""# test comment
Some line without comment
# another comment
"""

stream = TextIOWrapper(BytesIO(string))

at which point the while loop works and lines ends up as ['Some line without comment\n'].

You also don't need to use seeking or tell() here. You could just loop directly over a file object (including a TextIOWrapper() object):

lines = []
for line in stream:
    if not line.startswith('#'):
        lines.append(line)

or even:

lines = [l for l in stream if not l.startswith('#')]

If you are concerned about the TextIOWrapper() wrapper object closing the underlying stream when you no longer need the wrapper, just detach the wrapper first:

stream.detach()

Upvotes: 6

Related Questions