I Z
I Z

Reputation: 5927

python3 (nltk/numpy/etc): ISO efficient way to compute find pairs of similar strings

I have a list of N strings. My task is to find all pairs of strings that are sufficiently similar. That is, I need (i) a similarity metric that would produce a number in a predefined range (say between 0 and 1) that measures how similar the two strings are and (ii) a way of going through O(N^2) pairs quickly to find those that are above some sort of threshold (say >= 0.9 if the metric gives larger numbers for more similar strings). What I am doing now is pretty slow (as one might expect) for a large N:

import difflib

num_strings = len(my_strings)
for i in range(num_strings):
    s_i = my_strings[i]

    for j in range(i+1,num_strings):
        s_j = my_strings[j]
        sim = difflib.SequenceMatcher(a=s_i, b=s_j).ratio()
        if sim >= thresh:
            print("%s\t%s\t%f" % (s_i,s_j,sim))

Questions:

  1. What would be a good way of vectorizing this double loop to speed it up maybe using NLTK, numpy or any other library?
  2. Would you recommend a better metric than difflib's ratio (again, from NLTK, numpy etc)?

Thank you

Upvotes: 2

Views: 99

Answers (1)

Merkle Daamgard
Merkle Daamgard

Reputation: 86

If you want the optimal solution you have to be O(n^2), if you want an approximate of the optimal solution you can select a threshold and delete pairs that have a fair similarity ratio. I would suggest you use another metrics since you're adding complexity with the difflib's ratio (it depends on the length of the strings). These ratio could be entropy or manhattan/euclidean distance.

Upvotes: 1

Related Questions