Reputation: 867
I have to parse a huge (250 MB) text file, which for some reason is only a single line, causing every text editor I tried (Notepad++, Visual Studio, Matlab) to fail loading it. Therefore I read it piece by piece, and parse it whenever a logical line (starting with #
) is completely read:
f = open(filename, "rt")
line = ""
buffer = "blub"
while buffer != "":
buffer = f.read(10000)
i = buffer.find('#')
if i != -1: # end of line found
line += buffer[:i]
ProcessLine(line)
line = buffer[i+1:] # skip the '#'
else: # still reading current line
line += buffer
This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line. So I replaced the loop by
while buffer != "":
buffer = f.read(10000)
i = buffer.find('#')
while i != -1:
pixels += 1
line += buffer[:i]
buffer = buffer[i+1:]
ProcessLine(line)
i = buffer.find('#')
line += buffer
, which does the trick. However this is at least a hundred times slower, rendering it useless to read that large files. I don't really see, how this can happen, I do have a inner loop, but most of the times it is only repeated once. Also I probably copy the buffer (buffer = buffer[i+1:]
), from which I could somehow understand if the performance dropped by half, but I don't see how this could make it like 100 times slower.
As a side note: My (logical) lines are about 27.000 bytes. Therefore, if my buffer is 10.000 bytes, I never skip lines in the first implementation, if it is 30.000, I do. This does however not seem to impact the performance, even if the inner loop in the second implementation is evaluated at most once, performance is still horrible.
What is going on under the hood, that I miss?
Upvotes: 3
Views: 92
Reputation: 2046
If I understood correctly what you want to do, than both versions of your code are wrong. Like @Leon said in second version you are missing line = ""
after ProcessLine(line)
, and in first version just the first line is correct, and than like you sad if line is shorter than buffer, you use just first part of buffer in line += buffer[:i]
but the problem is in this line line = buffer[i+1:]
so if your line
is like 1000 characters long, and buffer
is 10000 characters long, then when you use line += buffer[:i]
, your line will be 9000 characters long probably containing more than one line. From reading:
"This works reasonably well, however, it might happen, that a line is shorter than my buffer, which would cause me to skip a line"
I think you realised that, but the reason I am writing in detail, is that it is also reason why your first version works faster.
After explaining that, I think the best would be to read hole file and then split text to get lines, so your code would look like this:
f = open('textfile.txt', "rt")
buffer = f.read()
f.close()
l = buffer.split('#')
and than you can use something like:
for line in l:
ProcessLine(line)
to get list l
it took me less than 2 seconds.
PS: You shouldn't have problems with opening large files (like 250MB) with notepad, I even opened 500MB files.
Upvotes: 2
Reputation: 32514
Your second version not only works slower but also works incorrectly.
In your first version you reset the line
with assignment (line = buffer[i+1:]
), whereas in the second version you only append to line
. As a result, in the second version, in the end line
contains entire contents of your file less the #
symbols.
Fix the code by clearing line
immediately after processing it:
while buffer != "":
buffer = f.read(10000)
i = buffer.find('#')
while i != -1:
pixels += 1
line += buffer[:i]
buffer = buffer[i+1:]
ProcessLine(line)
line = "" # sic!
i = buffer.find('#')
line += buffer
Upvotes: 1