Reputation: 647
I have two very long O(100k) list of words and I need to find all similar pairs. My solution has a time complexity of O(n*m). Is it a way to optimize this algorithm - reduce its complexity?
def are_similar(first, second):
threshold = 0.88
return difflib.SequenceMatcher(a=first.lower(), b=second.lower()).ratio() > threshold
list_1 = ["123456","23456", ...] # len(list_1) ~ 100k
list_2 =["123123","asda2131", ...] # len(list_2)~ 500k
similar = []
for element_list1 in list_1:
for element_list2 in list_2:
if are_similar(element_list1,element_list2 ):
similar.append((element_list1,element_list2 ))
print (similar)
What is the best way to parallelize above code? My current implementation, not included, uses multiprocessing.Pool over the first loop.
Upvotes: 2
Views: 325
Reputation: 7416
I can suggest another solution but I am not sure if you want exactly the same thing I am suggesting.
First, there are two lists
, if we match one element of a list with itself, similarity is 1
i.e. exact match. So, we may start with next word to compare. Now, lets take all the words in single list by taking set of lists.
list_1 = ["123456","23456", ...] # len(list_1) ~ 100k
list_2 =["123123","asda2131", ...] # len(list_2)~ 500k
list_3 = list_1 + list_2
list_3 = list(set(list_3)) # this will merge all same words to a list of unique words.
similar = []
for i in range(0, len(list_3)):
if are_similar(list_3[i], list_3[i+1]):
similar.append((list_3[i],list_3[i+1]))
print (similar)
I took here list of set of list of words
to compare since if we may compare exactly same words again and again and hence, we reduce significantly number of comparisons on repeated words. The complexity of this method is O(n)
. I hope this may help.
Upvotes: 1