Reputation: 2574
Right now I am writing some Python code to deal with massive twitter files. These files are so big that they can't fit into memory. To work with them, I basically have two choices.
I could split the files into smaller files that can fit into memory.
I could process the big file line by line so I never need to fit the entire file into memory at once. I would prefer the latter for ease of implementation.
However, I am wondering if it is faster to read in an entire file to memory and then manipulate it from there. It seems like it could be slow to constantly be reading a file line by line from disk. But then again, I do not fully understand how these processes work in Python. Does anyone know if line by line file reading will cause my code to be slower than if I read the entire file into memory and just manipulate it from there?
Upvotes: 8
Views: 8034
Reputation: 994897
For really fast file reading, have a look at the mmap module. This will make the entire file appear as a big chunk of virtual memory, even if it's much larger than your available RAM. If your file is bigger than 3 or 4 gigabytes, then you'll want to be using a 64-bit OS (and 64-bit build of Python).
I've done this for files over 30 GB in size with good results.
Upvotes: 10
Reputation: 3661
If you want to process the file line by line, you could simply use the file object as an iterator:
for line in open('file', 'r'):
print line
This is pretty memory efficient; if you want to work on a batch of lines at a time, you could also use the readlines()
method of the file object with a sizehint parameter. This reads in sizehint bytes plus enough number of bytes to complete the last line.
Upvotes: 1