Reputation: 5927
I have a list of N strings. My task is to find all pairs of strings that are sufficiently similar. That is, I need (i) a similarity metric that would produce a number in a predefined range (say between 0 and 1) that measures how similar the two strings are and (ii) a way of going through O(N^2) pairs quickly to find those that are above some sort of threshold (say >= 0.9 if the metric gives larger numbers for more similar strings). What I am doing now is pretty slow (as one might expect) for a large N:
import difflib
num_strings = len(my_strings)
for i in range(num_strings):
s_i = my_strings[i]
for j in range(i+1,num_strings):
s_j = my_strings[j]
sim = difflib.SequenceMatcher(a=s_i, b=s_j).ratio()
if sim >= thresh:
print("%s\t%s\t%f" % (s_i,s_j,sim))
Questions:
Thank you
Upvotes: 2
Views: 99
Reputation: 86
If you want the optimal solution you have to be O(n^2), if you want an approximate of the optimal solution you can select a threshold and delete pairs that have a fair similarity ratio. I would suggest you use another metrics since you're adding complexity with the difflib's ratio (it depends on the length of the strings). These ratio could be entropy or manhattan/euclidean distance.
Upvotes: 1