Efficiency of line by line file reading in Python

Question

Right now I am writing some Python code to deal with massive twitter files. These files are so big that they can't fit into memory. To work with them, I basically have two choices.

I could split the files into smaller files that can fit into memory.
I could process the big file line by line so I never need to fit the entire file into memory at once. I would prefer the latter for ease of implementation.

However, I am wondering if it is faster to read in an entire file to memory and then manipulate it from there. It seems like it could be slow to constantly be reading a file line by line from disk. But then again, I do not fully understand how these processes work in Python. Does anyone know if line by line file reading will cause my code to be slower than if I read the entire file into memory and just manipulate it from there?

Greg Hewgill · Accepted Answer

For really fast file reading, have a look at the mmap module. This will make the entire file appear as a big chunk of virtual memory, even if it's much larger than your available RAM. If your file is bigger than 3 or 4 gigabytes, then you'll want to be using a 64-bit OS (and 64-bit build of Python).

I've done this for files over 30 GB in size with good results.

Efficiency of line by line file reading in Python

Answers (2)

Related Questions