python3 (nltk/numpy/etc): ISO efficient way to compute find pairs of similar strings

Question

I have a list of N strings. My task is to find all pairs of strings that are sufficiently similar. That is, I need (i) a similarity metric that would produce a number in a predefined range (say between 0 and 1) that measures how similar the two strings are and (ii) a way of going through O(N^2) pairs quickly to find those that are above some sort of threshold (say >= 0.9 if the metric gives larger numbers for more similar strings). What I am doing now is pretty slow (as one might expect) for a large N:

import difflib

num_strings = len(my_strings)
for i in range(num_strings):
    s_i = my_strings[i]

    for j in range(i+1,num_strings):
        s_j = my_strings[j]
        sim = difflib.SequenceMatcher(a=s_i, b=s_j).ratio()
        if sim >= thresh:
            print("%s	%s	%f" % (s_i,s_j,sim))

Questions:

What would be a good way of vectorizing this double loop to speed it up maybe using NLTK, numpy or any other library?
Would you recommend a better metric than difflib's ratio (again, from NLTK, numpy etc)?

Thank you

python3 (nltk/numpy/etc): ISO efficient way to compute find pairs of similar strings

Answers (1)

Related Questions