sashoalm
sashoalm

Reputation: 79615

Read file line by line, but in reverse (last line first, then next to last, etc.)

I want to remove the trailing blank lines from a file (if it has any). Currently I do it by reading it in memory, removing the blank lines there, and overwriting it. The file is big, however (30000+ lines, and long lines), and this takes 2-3 seconds.

So I want to read the file line-by-line, but backwards, until I get to the first non-blank line. That is, I start with the last line, then the next to last, etc. I will just truncate it then, instead of overwriting it.

What would be the best way read it in reverse? Right now I'm thinking of reading blocks of 64k, and then looping through the string in reverse, char-by-char, until I get a line, and then when I run out of the 64k, to read another 64k and prepend them, and so on.

I assume there are no standard functions or libraries that read in reverse order?

Upvotes: 0

Views: 683

Answers (2)

Joe McMahon
Joe McMahon

Reputation: 3382

This is a modified version of some code I found elsewhere (probably here on StackOverflow, actually...) - I've extracted the two critical methods that handle reading backwards.

The reversed_blocks iterator reads backwards through the file in chunks of the size you prefer, and the reversed_lines iterator breaks the blocks into lines, saving the first one; if the next block ends in a newline, it returns this as a complete line, and if it doesn't, it appends the saved partial line to the last line of the new block, completing the line that was split over the block boundary.

All of the state is maintained by Python's iterator mechanisms, so we don't have to store the state globally anywhere; this also means you can read multiple files backward at once if you needed to, as the state is tied to the iterator.

def reversed_lines(self, file):
    "Generate the lines of file in reverse order."
    newline_char_set = set(['\r', '\n'])
    tail = ""
    for block in self.reversed_blocks(file):
        if block is not None and len(block)>0:
            # First split the whole block into lines and reverse the list
            reversed_lines = block.splitlines()
            reversed_lines.reverse()

            # If the last char of the block is not a newline, then the last line
            # crosses a block boundary, and the tail (possible partial line from
            # the previous block) should be added to it.
            if block[-1] not in newline_char_set:
                reversed_lines[0] = reversed_lines[0] + tail

            # Otherwise, the block ended on a line boundary, and the tail is a 
            # complete line itself.
            elif len(tail)>0:
                reversed_lines.insert(0,tail)

            # Within the current block, we can't tell if the first line is complete
            # or not, so we extract it and save it for the next go-round with a new
            # block. We yield instead of returning so all the internal state of this
            # iteration is preserved (how many lines returned, current tail, etc.).
            tail = reversed_lines.pop()

            for reversed_line in reversed_lines:
                yield reversed_line

    # We're out of blocks now; if there's a tail left over from the last block we read,
    # it's the very first line in the file. Yield that and we're done.
    if len(tail)>0:
        yield tail

def reversed_blocks(self, file, blocksize=4096):
    "Generate blocks of file's contents in reverse order."

    # Jump to the end of the file, and save the file offset.
    file.seek(0, os.SEEK_END)
    here = file.tell()

    # When the file offset reaches zero, we've read the whole file.
    while 0 < here:
        # Compute how far back we can step; either there's at least one
        # full block left, or we've gotten close enough to the start that
        # we'll read the whole file.
        delta = min(blocksize, here)

        # Back up to there and read the block; we yield it so that the 
        # variable containing the file offset is retained.
        file.seek(here - delta, os.SEEK_SET)
        yield file.read(delta)

        # Move the pointer back by the amount we just handed out. If we've
        # read the last block, "here" will now be zero.
        here -= delta

reversed_lines is an iterator, so you run it in a loop:

for line in self.reversed_lines(fh):
    do_something_with_the_line(line)

The comments are probably superfluous, but they were useful to me while I was working out how the iterators did their job.

Upvotes: 2

filmor
filmor

Reputation: 32258

with open(filename) as f:
    size = os.stat(filename).st_size
    f.seek(size - 4096)
    block = f.read(4096)
    # Find amount to truncate
    f.truncate(...)

Upvotes: 0

Related Questions