user1877600
user1877600

Reputation: 647

String similarity between two vectors of words

I have two very long O(100k) list of words and I need to find all similar pairs. My solution has a time complexity of O(n*m). Is it a way to optimize this algorithm - reduce its complexity?

def are_similar(first, second):
    threshold = 0.88
    return difflib.SequenceMatcher(a=first.lower(), b=second.lower()).ratio() > threshold


list_1 = ["123456","23456",  ...] # len(list_1) ~ 100k
list_2 =["123123","asda2131", ...] # len(list_2)~ 500k

similar = []
for element_list1 in list_1:
    for element_list2 in list_2:
        if are_similar(element_list1,element_list2 ):
            similar.append((element_list1,element_list2 ))

print (similar)

What is the best way to parallelize above code? My current implementation, not included, uses multiprocessing.Pool over the first loop.

Upvotes: 2

Views: 325

Answers (1)

Hari_pb
Hari_pb

Reputation: 7416

I can suggest another solution but I am not sure if you want exactly the same thing I am suggesting. First, there are two lists, if we match one element of a list with itself, similarity is 1 i.e. exact match. So, we may start with next word to compare. Now, lets take all the words in single list by taking set of lists.

list_1 = ["123456","23456",  ...] # len(list_1) ~ 100k
list_2 =["123123","asda2131", ...] # len(list_2)~ 500k


list_3 = list_1 + list_2
list_3 = list(set(list_3)) # this will merge all same words to a list of unique words.
similar = []
for i in range(0, len(list_3)):
    if are_similar(list_3[i], list_3[i+1]):
        similar.append((list_3[i],list_3[i+1]))

print (similar)

I took here list of set of list of words to compare since if we may compare exactly same words again and again and hence, we reduce significantly number of comparisons on repeated words. The complexity of this method is O(n). I hope this may help.

Upvotes: 1

Related Questions