Reputation: 47
here is my code. it takes 17 hours to complete.could you please suggest any alternative code to reduce the time of computation?
# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
for y in dataset2['name']:
if (fuzz.token_sort_ratio(x,y) > 85):
matched_pair.append((x,y))
print((x,y))
I tried different ones but did not work ((.
dataset1 - 10krows, dataset2 - 1M rows, fuzz.token_sort_ratio(x,y) - is a function which takes 2 parameters (2strings) and outputs integer - the similarity of these 2 strings
Upvotes: 1
Views: 554
Reputation: 3265
Since the dataframe is not really used here I will simply work with the following two lists:
import string
import random
random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
using these two lists with the code you provided using fuzzywuzzy. As a first change you could use RapidFuzz (I am the author) which is basically doing the same as FuzzyWuzzy, but is quite a bit faster. When using my tests lists this was about 7 times as fast as your code. Another issue is that when using fuzz.token_sort_ratio the strings are always lowercased and e.g. punctuation is removed. While this makes sence for the string matching, your doing it multiple times for each string in the list, which adds up when working with bigger lists. Using RapidFuzz and preprocessing only once is about 14 times as fast on these lists.
from rapidfuzz import fuzz, utils
dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]
matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
for word2, word2_processed in zip(dataset2, dataset2_processed):
if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
matched_pair.append((word1, word2))
Upvotes: 1