Pass more than one parameter in FuzzyWuzzy

Question

I have applied fuzzy-wuzzy function on dataset with more than 9000 records as shown here:

def fuzzy(name, column):
    all = []     
    #fuzzy set 
    set = process.extract(name, column, scorer=fuzz.token_set_ratio)
    for set_result in set:
        set_data = {}
        set_data['name'] = set_result[0] 
        set_data['Matching Score'] = set_result[1]
        set_data['Function'] = "set" 
        all.append(set_data)
    return all  

     
#apply similrty  
def Get_all(name): 
    fuzzy_all= []
    fuzzy= fuzzy(name,table.Name)
    fuzzy2= fuzzy(soundex.encode_word(name),table["name_encoded"])
    fuzzy_all=fuzzy+fuzzy2
    return fuzzy_all

Is there a way to improve the function and pass more than one parameter at the same time (column or name), by calling fuzzy one time only so that fuzzy do not have to go to the whole dataset many times?

maxbachmann · Accepted Answer

process.extract

I will start with an explanation of what process.extract can be used for and what all the arguments that can be passed to it mean, since it is often misused which can have a big impact on the performance. In the explanation I will always refer to the library rapidfuzz (I am the author). Rapidfuzz implements the same string matching algorithms and has a very similar interface (there are some differences when they helped to improve the performance).

process.extract in rapidfuzz has the following interface:

extract(
  query,
  choices,
  scorer=,
  processor=,
  limit=5,
  score_cutoff=0)

The function is used to find the best matches for a query in a list of choices

query

query is the string you want to find

choices

choices supports two kinds of inputs that affect what the function will return

a type that has method items. Examples for this are dicts or pandas.Series In this case the function will compare all values of the mapping to the query and return a sorted list of the results in the form [(, , ),...]. So e.g. using a pandas.Series you will receive [(, , ),...].
any type that is iterable like e.g. Lists or Generators In this case the function will compare all values of the Iterable to the query and return a sorted list of the results in the form [(, ),...].

processor

processor is a function that is used to preprocess the query and each choice before they are compared. This can be any function that accepts a string as the argument and returns a string aswell. By default this is rapidfuzz.utils.default_process, which will lower case the string, remove non alphanumeric characters and remove the whitespaces from begin/end of the string.

scorer

scorer is the function that is used to compare the query to each choice and requires the following interface:

def scorer(s1, s2, processor, score_cutoff)

s1 and s2 are the two strings. Processor is generally passed as None by process.extract to deactivate preprocessing, since this was already done inside process.extract. Score_cutoff is the minimum similarity required between the two strings. If this similarity is not reached 0 should be returned instead.

By default this scorer is fuzz.WRatio which is combining multiple different ratios and weights them. The other available scorers can be found here: scorer docs Or you can create your own function with a similar interface to perform the comparision.

score_cutoff

As described above score_cutoff is used to set a minimum similarity for the string matching. Elements below this similarity will not be added to the results. Providing this parameter helps to improve the performance, since it allows rapidfuzz to use faster algorithms to compare strings: When no score_cutoff is provided rapidfuzz will always use the levenshtein distance which has O(N*M) runtime. However when it is passed it will:

compare the lengths of the two strings to find out whether the score_cutoff can even be reached in constant time and return 0 if it can not reach it (score at least as high as from the levenshtein distance)
it will count how many characters appear in one string but not in the other in O(N+M) time to find out whether score_cutoff can be reached and exit early if it can't (score at least as high as from the levenshtein distance)
only if it could not exit early the levenshtein distance is calculated

limit

Using this parameter you can limit the amount of results to the limit best matches in choices. By default this is 5 so you will only get the 5 best results. By setting limit=None you will get all results as long as they did not get filtered out by the score_cutoff

The most time consuming part in your function is the call to process.extract.

extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):

Possibilities to improve the performance

There are couple of things that might help you to achieve a better performance:

You should probably use rapidfuzz since it is a lot faster. This is because I did implement everything in C++ and a lot of the algorithms of Fuzzywuzzy could be improved a lot.
You can deactivate pre-processing the strings by passing processor=None to process.extract. When you require the pre-processing for your data you can perform it once ahead of time so you do not have to perform it whenever you call Get_all
As described above providing a score_cutoff can help to improve the performance
A last option would be to write a optimised version of process.extract. Right now it calculates the scores for all choices passed. However after finding limit choices with a score above score_cutoff it would be possible to increase score_cutoff to the smallest score in the results. However this would require maintaining a sorted list of the current best results, so it might or might not be worth it. When your only interested in the best result process.extractOne is already using this technique to improve the runtime.