Guram Keretchashvili
Guram Keretchashvili

Reputation: 47

reduce time-complexity of nested loop in python

here is my code. it takes 17 hours to complete.could you please suggest any alternative code to reduce the time of computation?

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I tried different ones but did not work ((.

dataset1 - 10krows, dataset2 - 1M rows, fuzz.token_sort_ratio(x,y) - is a function which takes 2 parameters (2strings) and outputs integer - the similarity of these 2 strings

Upvotes: 1

Views: 554

Answers (1)

maxbachmann
maxbachmann

Reputation: 3265

Since the dataframe is not really used here I will simply work with the following two lists:

import string
import random

random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]

using these two lists with the code you provided using fuzzywuzzy. As a first change you could use RapidFuzz (I am the author) which is basically doing the same as FuzzyWuzzy, but is quite a bit faster. When using my tests lists this was about 7 times as fast as your code. Another issue is that when using fuzz.token_sort_ratio the strings are always lowercased and e.g. punctuation is removed. While this makes sence for the string matching, your doing it multiple times for each string in the list, which adds up when working with bigger lists. Using RapidFuzz and preprocessing only once is about 14 times as fast on these lists.

from rapidfuzz import fuzz, utils

dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]

matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
    for word2, word2_processed in zip(dataset2, dataset2_processed):
        if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
            matched_pair.append((word1, word2))

Upvotes: 1

Related Questions