Python Read Large Text Files

Question

I was trying to compare two large text files line by line (10GB each) without loading the entire files into memory. I used the following code as indicated in other threads:

with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
    for (line1, line2) in zip(f1, f2):
        compare(line1, line2)

But it seems that python fails to read the file line by line. I observed the memory usage while running the code is > 20G. I also tried using:

import fileinput
for (line1, line2) in zip(fileinput.input([in_file1]),fileinput.input([in_file2])):
    compare(line1, line2)

This one also tries to load everything into memory. I'm using Python 2.7.4 on Centos 5.9, and I didn't store any of the lines in my code.

What was going wrong in my code? How should I change it to avoid loading everything into RAM?

Thomas B. · Accepted Answer

Python's zip function returns a list of tuples. So if fetches the complete files to build this list. Use itertools.izip instead. It will return an iterator of tuples.

with open(in_file1,"r") as f1, open(in_file2,"r") as f2:
    for (line1, line2) in izip(f1, f2):
        compare(line1, line2)

Python Read Large Text Files

Answers (1)

Related Questions