Reputation: 1788
We hawe a long list with strings (approx. 18k entries). The goal is to find all similar strings and to group them by maximum similarity. ("a" is the list with strings)
I have wrote the following code:
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
dupl = {}
while len(a) > 0:
k = a.pop()
if k not in dupl.keys():
dupl[k] = []
for i,j in enumerate(a):
dif = diff(k, j)
if dif > 0.5:
dupl[k].append("{0}: {1}".format(dif, j))
This code take an element from the list and search for duplicates in the rest of the list. If the similarity is more than 0.5, the similar string is added to the dict.
Everything works well, but very, very slow because of length of a list "a". So I would like to ask is there a way to optimize somehow this code? Any ideas?
Upvotes: 3
Views: 689
Reputation: 33509
A couple of small optimisations:
You could remove duplicates from the list before starting the search (e.g. a=list(set(a))). At the moment, if a contains 18k copies of the string 'hello' it will call diff 18k*18k times.
Curently you will be comparing string number i with string number j, and also string number j with string number i. I think these will return the same result so you could only compute one of these and perhaps go twice as fast.
Of course, the basic problem is that diff is being called n*n times for a list of length n and an ideal solution would be to reduce the number of times diff is being called. The approach to use will depend on the content of your strings.
Here are a few examples of possible approaches that would be relevant to different cases:
Suppose the strings are of very different lengths. diff will only return >0.5 if the lengths of the strings are within a factor of 2. In this case you could sort the input strings by length in O(nlogn) time, and then only compare strings with similar lengths.
Suppose the strings are of sequences of words and expected to be either very different or very similar. You could construct an inverted index for the words and then only compare with strings which contain the same unusual words
Suppose you expect the strings to fall into a small number of groups. You could try running a K-means algorithm to group them into clusters. This would take K*n*I where I is the number of iterations of the K-means algorithm you choose to use.
If n grows to be very large (many million), then these will not be appropriate and you will probably need to use more approximate techniques. One example that is used for clustering web pages is called MinHash
Upvotes: 2
Reputation: 3427
When needing to iterate over many items, itertools, to the rescue!
This snippet will permute all the possibilities of your string (permutations) and return them in the fashion your original code did. I feel like the not in
is a needlessly expensive way to check and not as pythonic. Permutations was chosen as it would give you the most access to checking a->b or b->a for two given strings.
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def calculate_ratios(strings):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
try:
dupl[s].append({t: diff(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a)
Depending on your constraints, (since permutations are redundant computationally and space-wise), you can replace permutations with combinations, but then your accessing method will need to be adjusted (since a-b will only be listed in a[b] but not b[a]).
In the code I use quick_ratio(), but it is just as simply changed to ratio() or real_quick_ratio() depending on your decision of if there's enough precision.
And in such a case, a simple IF will solve that problem:
import difflib
import itertools
def diff(a, b):
return difflib.SequenceMatcher(None, a, b).quick_ratio()
def diff2(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
def calculate_ratios(strings, threshold):
dupl = dict()
for s, t in itertools.permutations(strings, 2):
if diff(s,t) > threshold: #arbitrary threshhold
try:
dupl[s].append({t: diff2(s,t)})
except KeyError:
dupl[s] = []
dupl[s].append({t: diff2(s,t)})
return dupl
a = ['first string', 'second string', 'third string', 'fourth string']
print calculate_ratios(a, 0.5)
Upvotes: 2