Reputation: 368
What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?
I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair
def kernel:
a_file=open('a.txt','r')
b_file=open('b.txt', 'r')
a_line = a_file.readline()
b_line = b_file.readline()
while a_line:
process(a_spl,b_spl) #process requiring both corresponding file lines
I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.
thank you.
Upvotes: 1
Views: 2685
Reputation: 19377
The below code does not accumulate data from the input files in memory, unless the process
function does that by itself.
from itertools import izip
def process(line1, line2):
# process a line from each input
with open(file1, 'r') as f1:
with open(file2, 'r') as f2:
for a, b in izip(f1, f2):
process(a, b)
If the process
function is efficient, this code should run quickly enough for most purposes. The for
loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.
Upvotes: 2
Reputation: 2976
String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:
queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
queue.put(pair) # blocks here on full queue
You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.
Upvotes: 1
Reputation: 137450
You can use with
statement to make sure your files are closed after the execution. From this blog entry:
to open a file, process its contents, and make sure to close it, you can simply do:
with open("x.txt") as f:
data = f.read()
do something with data
Upvotes: 1
Reputation: 7336
I'd change your while condition to the following so that it doesn't fail when a has more lines than b.
while a_line and b_line
Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N
pairs of line and sending each pair off to a new thread or similar.
Upvotes: 0