Optimization of python code for string duplicates search

Question

We hawe a long list with strings (approx. 18k entries). The goal is to find all similar strings and to group them by maximum similarity. ("a" is the list with strings)

I have wrote the following code:

def diff(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

dupl = {}

while len(a) > 0:
    k = a.pop()
    if k not in dupl.keys():
        dupl[k] = []
    for i,j in enumerate(a):
            dif = diff(k, j)
            if dif > 0.5:
                dupl[k].append("{0}: {1}".format(dif, j))

This code take an element from the list and search for duplicates in the rest of the list. If the similarity is more than 0.5, the similar string is added to the dict.

Everything works well, but very, very slow because of length of a list "a". So I would like to ask is there a way to optimize somehow this code? Any ideas?

Peter de Rivaz · Accepted Answer

A couple of small optimisations:

You could remove duplicates from the list before starting the search (e.g. a=list(set(a))). At the moment, if a contains 18k copies of the string 'hello' it will call diff 18k*18k times.
Curently you will be comparing string number i with string number j, and also string number j with string number i. I think these will return the same result so you could only compute one of these and perhaps go twice as fast.

Of course, the basic problem is that diff is being called n*n times for a list of length n and an ideal solution would be to reduce the number of times diff is being called. The approach to use will depend on the content of your strings.

Here are a few examples of possible approaches that would be relevant to different cases:

Suppose the strings are of very different lengths. diff will only return >0.5 if the lengths of the strings are within a factor of 2. In this case you could sort the input strings by length in O(nlogn) time, and then only compare strings with similar lengths.
Suppose the strings are of sequences of words and expected to be either very different or very similar. You could construct an inverted index for the words and then only compare with strings which contain the same unusual words
Suppose you expect the strings to fall into a small number of groups. You could try running a K-means algorithm to group them into clusters. This would take K*n*I where I is the number of iterations of the K-means algorithm you choose to use.

If n grows to be very large (many million), then these will not be appropriate and you will probably need to use more approximate techniques. One example that is used for clustering web pages is called MinHash

Optimization of python code for string duplicates search

Answers (2)

Related Questions