andrew
andrew

Reputation: 2574

Efficiency of line by line file reading in Python

Right now I am writing some Python code to deal with massive twitter files. These files are so big that they can't fit into memory. To work with them, I basically have two choices.

  1. I could split the files into smaller files that can fit into memory.

  2. I could process the big file line by line so I never need to fit the entire file into memory at once. I would prefer the latter for ease of implementation.

However, I am wondering if it is faster to read in an entire file to memory and then manipulate it from there. It seems like it could be slow to constantly be reading a file line by line from disk. But then again, I do not fully understand how these processes work in Python. Does anyone know if line by line file reading will cause my code to be slower than if I read the entire file into memory and just manipulate it from there?

Upvotes: 8

Views: 8034

Answers (2)

Greg Hewgill
Greg Hewgill

Reputation: 994897

For really fast file reading, have a look at the mmap module. This will make the entire file appear as a big chunk of virtual memory, even if it's much larger than your available RAM. If your file is bigger than 3 or 4 gigabytes, then you'll want to be using a 64-bit OS (and 64-bit build of Python).

I've done this for files over 30 GB in size with good results.

Upvotes: 10

spinlok
spinlok

Reputation: 3661

If you want to process the file line by line, you could simply use the file object as an iterator:

for line in open('file', 'r'):
    print line

This is pretty memory efficient; if you want to work on a batch of lines at a time, you could also use the readlines() method of the file object with a sizehint parameter. This reads in sizehint bytes plus enough number of bytes to complete the last line.

Upvotes: 1

Related Questions