Briana
Briana

Reputation: 40

Optimizing removing duplicates in large files in Python

I have one very large text file(27GB) that I am attempting to make smaller by removing lines that are duplicated in a second database that has several files of a more reasonable size(500MB-2GB). I have some functional code, what I am wondering is is there any way I can optimize this code to run faster, human clock time? At the moment, on a small test run with a 1.5GB input and 500MB filter, this takes 75~ seconds to complete.

I've gone through many iterations of this idea, this one is currently best for time, if anyone has ideas for making a better logical structure for the filter I'd love to hear it, past attempts that have all been worse than this one: Loading the filter into a set and cycling through the input searching for duplicates(about half as fast as this), loading the input into a set and running the filter through a difference_update(almost as fast as this but also doing the reverse of what I wanted), and loading both input and filter into sets in chunks and doing set differences(that was a horrible, horrible idea that would work(maybe?) if my filters were smaller so I didn't have to split them, too.)

So, those are all the things I've tried. All of these processes max out on CPU, and my final version runs at about 25-50% disk I/O, filter and output are on one physical disk, input is on another. I am running a dual core and have no idea if this particular script can be threaded, never done any multithreading before so if that's a possibility I'd love to be pointed in the right direction.

Information about the data! As said earlier, the input is many times larger than the filter. I am expecting a very small percentage of duplication. The data is in lines, all of which are under 20 ASCII characters long. The files are all sorted.

I've already changed the order of the three logical statements, based on the expectation that unique input lines will be the majority of the lines, then unique filter, then duplicates, which on the 'best' case of having no duplicates at all, saved me about 10% time.

Any suggestions?

def sortedfilter(input,filter,output):
    file_input = open(input,'r')
    file_filter = open(filter,'r')
    file_output = open(output,'w')
    inline = file_input.next()
    filterline = file_filter.next()
    try:
        while inline and filterline:
            if inline < filterline:
                file_output.write(inline)
                inline = file_input.next()
                continue
            if inline > filterline:
                filterline = file_filter.next()
                continue
            if inline == filterline:
                filterline = file_filter.next()
                inline = file_input.next()
    except StopIteration:
        file_output.writelines(file_input.readlines())
    finally:
        file_filter.close()
        file_input.close()
        file_output.close()

Upvotes: 1

Views: 1529

Answers (4)

Marco de Wit
Marco de Wit

Reputation: 2804

Consider opening your files as binary so no conversion to unicode needs to be made:

with open(in_fname,'rb') as inf, open(filter_fname,'rb') as fil, open(out_fname, 'wb') as outf:

Upvotes: 0

iruvar
iruvar

Reputation: 23364

seeing that your input files are sorted, the following should work. heapq's merge produces a sorted stream from sorted inputs upon which groupby operates; groups whose lengths are greater than 1 are discarded. This stream-based approach should have relatively low memory requirements

from itertools import groupby, repeat, izip
from heapq import merge
from operator import itemgetter

with open('input.txt', 'r') as file_input, open('filter.txt', 'r') as file_filter, open('output.txt', 'w') as file_output:
  file_input_1 = izip(file_input, repeat(1))
  file_filter_1 = izip(file_filter, repeat(2))
  gen = merge(file_input_1, file_filter_1)
  gen = ((k, list(g)) for (k, g) in groupby(gen, key=itemgetter(0)))
  gen = (k for (k, g) in gen if len(g) == 1 and g[0][1]  == 1)
  for line in gen:
    file_output.write(line)

Upvotes: 1

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

I'd be interested in knowing how this compares; it's basically just playing with the order of iteration a bit.

def sortedfilter(in_fname, filter_fname, out_fname):
    with open(in_fname) as inf, open(filter_fname) as fil, open(out_fname, 'w') as outf:
        ins = inf.next()
        try:
            for fs in fil:
                while ins < fs:
                    outf.write(ins)
                    ins = inf.next()
                while ins == fs:
                    ins = inf.next()
        except StopIteration:
            # reached end of inf before end of fil
            pass
        else:
            # reached end of fil first, pass rest of inf through
            file_output.writelines(file_input.readlines())

Upvotes: 1

Alex Wilson
Alex Wilson

Reputation: 6740

You can do the string comparison operation only once per line by executing cmp(inline, filterline):

  • -1 means inline < filterline
  • 0 means inline == filterline
  • +1 means inline < filterline.

This might get you a few extra percent. So something like:

    while inline and filterline:
        comparison = cmp(inline, filterline)
        if comparison == -1:
            file_output.write(inline)
            inline = file_input.next()
            continue
        if comparison == 1:
            filterline = file_filter.next()
            continue
        if comparison == 0:
            filterline = file_filter.next()
            inline = file_input.next()

Upvotes: 1

Related Questions