Reputation: 309
I was trying to compare two large text files line by line (10GB each) without loading the entire files into memory. I used the following code as indicated in other threads:
with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
for (line1, line2) in zip(f1, f2):
compare(line1, line2)
But it seems that python fails to read the file line by line. I observed the memory usage while running the code is > 20G. I also tried using:
import fileinput
for (line1, line2) in zip(fileinput.input([in_file1]),fileinput.input([in_file2])):
compare(line1, line2)
This one also tries to load everything into memory. I'm using Python 2.7.4 on Centos 5.9, and I didn't store any of the lines in my code.
What was going wrong in my code? How should I change it to avoid loading everything into RAM?
Upvotes: 0
Views: 439
Reputation: 2296
Python's zip function returns a list of tuples. So if fetches the complete files to build this list. Use itertools.izip instead. It will return an iterator of tuples.
with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
for (line1, line2) in izip(f1, f2):
compare(line1, line2)
Upvotes: 6