String similarity between two vectors of words

Question

I have two very long O(100k) list of words and I need to find all similar pairs. My solution has a time complexity of O(n*m). Is it a way to optimize this algorithm - reduce its complexity?

def are_similar(first, second):
    threshold = 0.88
    return difflib.SequenceMatcher(a=first.lower(), b=second.lower()).ratio() > threshold


list_1 = ["123456","23456",  ...] # len(list_1) ~ 100k
list_2 =["123123","asda2131", ...] # len(list_2)~ 500k

similar = []
for element_list1 in list_1:
    for element_list2 in list_2:
        if are_similar(element_list1,element_list2 ):
            similar.append((element_list1,element_list2 ))

print (similar)

What is the best way to parallelize above code? My current implementation, not included, uses multiprocessing.Pool over the first loop.

String similarity between two vectors of words

Answers (1)

Related Questions