joedborg
joedborg

Reputation: 18353

Handling big text files in Python

Basics are that I need to process 4gig text files on a per line basis.

using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.

POSSIBLE ANSWER:

file.readlines([sizehint])¶
Read until EOF using readline() and return a list containing the lines

thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.

Didn't realize you could do this!

Upvotes: 1

Views: 942

Answers (3)

Sven Marnach
Sven Marnach

Reputation: 601609

You can just iterate over the file object:

with open("filename") as f:
    for line in f:
        whatever

This will do some internal buffering to improve the performance. (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline().)

Upvotes: 7

Jakob Bowyer
Jakob Bowyer

Reputation: 34698

You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through. e.g.

a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes

Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.

Upvotes: 0

orlp
orlp

Reputation: 117681

If you want to do something on a per-line basis you can just loop over the file object:

f = open("w00t.txt")
for line in f:
    # do stuff

However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n, process on that part and prepend the part that is left to the next chunk.

Upvotes: 1

Related Questions