Reputation: 368

Python: efficient file io

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?

I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair

def kernel:
    a_file=open('a.txt','r')
    b_file=open('b.txt', 'r')
    a_line = a_file.readline()
    b_line = b_file.readline()
    while a_line:
        process(a_spl,b_spl) #process requiring both corresponding file lines

I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.

thank you.

Upvotes: 1

Answers (4)

wberry

Reputation: 19377

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.

from itertools import izip

def process(line1, line2):
  # process a line from each input

with open(file1, 'r') as f1:
  with open(file2, 'r') as f2:
    for a, b in izip(f1, f2):
      process(a, b)

If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.

Upvotes: 2

Lars Yencken

Reputation: 2976

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:

queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
     queue.put(pair) # blocks here on full queue

You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

Upvotes: 1

Tadeck

Reputation: 137450

You can use with statement to make sure your files are closed after the execution. From this blog entry:

to open a file, process its contents, and make sure to close it, you can simply do:

with open("x.txt") as f:
    data = f.read()
    do something with data

Upvotes: 1

ObscureRobot

Reputation: 7336

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.

while a_line and b_line

Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

Upvotes: 0

Python: efficient file io

Answers (4)

Related Questions