Reputation: 79615
I want to remove the trailing blank lines from a file (if it has any). Currently I do it by reading it in memory, removing the blank lines there, and overwriting it. The file is big, however (30000+ lines, and long lines), and this takes 2-3 seconds.
So I want to read the file line-by-line, but backwards, until I get to the first non-blank line. That is, I start with the last line, then the next to last, etc. I will just truncate it then, instead of overwriting it.
What would be the best way read it in reverse? Right now I'm thinking of reading blocks of 64k, and then looping through the string in reverse, char-by-char, until I get a line, and then when I run out of the 64k, to read another 64k and prepend them, and so on.
I assume there are no standard functions or libraries that read in reverse order?
Upvotes: 0
Views: 683
Reputation: 3382
This is a modified version of some code I found elsewhere (probably here on StackOverflow, actually...) - I've extracted the two critical methods that handle reading backwards.
The reversed_blocks
iterator reads backwards through the file in chunks of the size you prefer, and the reversed_lines
iterator breaks the blocks into lines, saving the first one; if the next block ends in a newline, it returns this as a complete line, and if it doesn't, it appends the saved partial line to the last line of the new block, completing the line that was split over the block boundary.
All of the state is maintained by Python's iterator mechanisms, so we don't have to store the state globally anywhere; this also means you can read multiple files backward at once if you needed to, as the state is tied to the iterator.
def reversed_lines(self, file):
"Generate the lines of file in reverse order."
newline_char_set = set(['\r', '\n'])
tail = ""
for block in self.reversed_blocks(file):
if block is not None and len(block)>0:
# First split the whole block into lines and reverse the list
reversed_lines = block.splitlines()
reversed_lines.reverse()
# If the last char of the block is not a newline, then the last line
# crosses a block boundary, and the tail (possible partial line from
# the previous block) should be added to it.
if block[-1] not in newline_char_set:
reversed_lines[0] = reversed_lines[0] + tail
# Otherwise, the block ended on a line boundary, and the tail is a
# complete line itself.
elif len(tail)>0:
reversed_lines.insert(0,tail)
# Within the current block, we can't tell if the first line is complete
# or not, so we extract it and save it for the next go-round with a new
# block. We yield instead of returning so all the internal state of this
# iteration is preserved (how many lines returned, current tail, etc.).
tail = reversed_lines.pop()
for reversed_line in reversed_lines:
yield reversed_line
# We're out of blocks now; if there's a tail left over from the last block we read,
# it's the very first line in the file. Yield that and we're done.
if len(tail)>0:
yield tail
def reversed_blocks(self, file, blocksize=4096):
"Generate blocks of file's contents in reverse order."
# Jump to the end of the file, and save the file offset.
file.seek(0, os.SEEK_END)
here = file.tell()
# When the file offset reaches zero, we've read the whole file.
while 0 < here:
# Compute how far back we can step; either there's at least one
# full block left, or we've gotten close enough to the start that
# we'll read the whole file.
delta = min(blocksize, here)
# Back up to there and read the block; we yield it so that the
# variable containing the file offset is retained.
file.seek(here - delta, os.SEEK_SET)
yield file.read(delta)
# Move the pointer back by the amount we just handed out. If we've
# read the last block, "here" will now be zero.
here -= delta
reversed_lines
is an iterator, so you run it in a loop:
for line in self.reversed_lines(fh):
do_something_with_the_line(line)
The comments are probably superfluous, but they were useful to me while I was working out how the iterators did their job.
Upvotes: 2
Reputation: 32258
with open(filename) as f:
size = os.stat(filename).st_size
f.seek(size - 4096)
block = f.read(4096)
# Find amount to truncate
f.truncate(...)
Upvotes: 0