Reputation: 5455
I have a list of log files, where each line in each file has a timestamp and the lines are pre-sorted ascending within each file. The different files can have overlapping time ranges, and my goal is to blend them together into one large file, sorted by timestamp. There can be ties in the sorting, in which case I want to next line to come from whatever file is listed first in my input list.
I've seen examples of how to do this using fileinput
(see here), but this seems to read all the files into memory. Owing to the large size of my files this will be a problem. Because my files are pre-sorted, it seems there should be a way to merge them using a method that only has to consider the most recent unexplored line from each file.
Upvotes: 3
Views: 4741
Reputation: 176
Why roll your own if there is heapq.merge()
in the standard library? Unfortunately it doesn't provide a key argument -- you have to do the decorate - merge - undecorate dance yourself:
from itertools import imap
from operator import itemgetter
import heapq
def extract_timestamp(line):
"""Extract timestamp and convert to a form that gives the
expected result in a comparison
"""
return line.split()[1] # for example
with open("log1.txt") as f1, open("log2.txt") as f2:
sources = [f1, f2]
with open("merged.txt", "w") as dest:
decorated = [
((extract_timestamp(line), line) for line in f)
for f in sources]
merged = heapq.merge(*decorated)
undecorated = imap(itemgetter(-1), merged)
dest.writelines(undecorated)
Every step in the above is "lazy". As I avoid file.readlines()
the lines in the files are read as needed. Likewise the decoration process which uses generator expressions rather than list-comps. heapq.merge()
is lazy, too -- it needs one item per input iterator simultaneously to do the necessary comparisons. Finally I'm using itertools.imap()
, the lazy variant of the map() built-in to undecorate.
(In Python 3 map() has become lazy, so you can use that)
Upvotes: 16
Reputation: 798804
You want to implement a file-based merge sort. Read a line from both files, output the older line, then read another line from that file. Once one of the files is exhausted, output all the remaining lines from the other file.
Upvotes: 1