Reputation: 18353
Basics are that I need to process 4gig text files on a per line basis.
using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.
POSSIBLE ANSWER:
file.readlines([sizehint])¶
Read until EOF using readline() and return a list containing the lines
thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.
Didn't realize you could do this!
Upvotes: 1
Views: 942
Reputation: 601609
You can just iterate over the file object:
with open("filename") as f:
for line in f:
whatever
This will do some internal buffering to improve the performance. (Note that file.readline()
will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline()
.)
Upvotes: 7
Reputation: 34698
You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through. e.g.
a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes
Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.
Upvotes: 0
Reputation: 117681
If you want to do something on a per-line basis you can just loop over the file object:
f = open("w00t.txt")
for line in f:
# do stuff
However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n
, process on that part and prepend the part that is left to the next chunk.
Upvotes: 1