Unexpected Performance Decrease

Question

I have to parse a huge (250 MB) text file, which for some reason is only a single line, causing every text editor I tried (Notepad++, Visual Studio, Matlab) to fail loading it. Therefore I read it piece by piece, and parse it whenever a logical line (starting with #) is completely read:

f = open(filename, "rt")

line = ""
buffer = "blub"
while buffer != "":
    buffer = f.read(10000)
    i = buffer.find('#')
    if i != -1: # end of line found
        line += buffer[:i]
        ProcessLine(line)
        line = buffer[i+1:] # skip the '#'
    else: # still reading current line
        line += buffer

This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line. So I replaced the loop by

while buffer != "":
    buffer = f.read(10000)
    i = buffer.find('#')
    while i != -1:
        pixels += 1
        line += buffer[:i]
        buffer = buffer[i+1:]
        ProcessLine(line)
        i = buffer.find('#')
    line += buffer

, which does the trick. However this is at least a hundred times slower, rendering it useless to read that large files. I don't really see, how this can happen, I do have a inner loop, but most of the times it is only repeated once. Also I probably copy the buffer (buffer = buffer[i+1:]), from which I could somehow understand if the performance dropped by half, but I don't see how this could make it like 100 times slower.

As a side note: My (logical) lines are about 27.000 bytes. Therefore, if my buffer is 10.000 bytes, I never skip lines in the first implementation, if it is 30.000, I do. This does however not seem to impact the performance, even if the inner loop in the second implementation is evaluated at most once, performance is still horrible.

What is going on under the hood, that I miss?

ands · Accepted Answer

If I understood correctly what you want to do, than both versions of your code are wrong. Like @Leon said in second version you are missing line = "" after ProcessLine(line), and in first version just the first line is correct, and than like you sad if line is shorter than buffer, you use just first part of buffer in line += buffer[:i] but the problem is in this line line = buffer[i+1:] so if your line is like 1000 characters long, and buffer is 10000 characters long, then when you use line += buffer[:i], your line will be 9000 characters long probably containing more than one line. From reading:

"This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line"

I think you realised that, but the reason I am writing in detail, is that it is also reason why your first version works faster.

After explaining that, I think the best would be to read hole file and then split text to get lines, so your code would look like this:

f = open('textfile.txt', "rt")
buffer = f.read()
f.close()
l = buffer.split('#')

and than you can use something like:

for line in l:
    ProcessLine(line)

to get list l it took me less than 2 seconds.

PS: You shouldn't have problems with opening large files (like 250MB) with notepad, I even opened 500MB files.

Unexpected Performance Decrease

Answers (2)

Related Questions