Reputation: 495
I have applied fuzzy-wuzzy function on dataset with more than 9000 records as shown here:
def fuzzy(name, column):
all = []
#fuzzy set
set = process.extract(name, column, scorer=fuzz.token_set_ratio)
for set_result in set:
set_data = {}
set_data['name'] = set_result[0]
set_data['Matching Score'] = set_result[1]
set_data['Function'] = "set"
all.append(set_data)
return all
#apply similrty
def Get_all(name):
fuzzy_all= []
fuzzy= fuzzy(name,table.Name)
fuzzy2= fuzzy(soundex.encode_word(name),table["name_encoded"])
fuzzy_all=fuzzy+fuzzy2
return fuzzy_all
Is there a way to improve the function and pass more than one parameter at the same time (column
or name
), by calling fuzzy one time only so that fuzzy do not have to go to the whole dataset many times?
Upvotes: 2
Views: 2509
Reputation: 3265
I will start with an explanation of what process.extract
can be used for and what all the arguments that can be passed to it mean, since it is often misused which can have a big impact on the performance. In the explanation I will always refer to the library rapidfuzz (I am the author). Rapidfuzz implements the same string matching algorithms and has a very similar interface (there are some differences when they helped to improve the performance).
process.extract
in rapidfuzz has the following interface:
extract(
query,
choices,
scorer=<built-in function WRatio>,
processor=<built-in function default_process>,
limit=5,
score_cutoff=0)
The function is used to find the best matches for a query in a list of choices
query is the string you want to find
choices supports two kinds of inputs that affect what the function will return
a type that has method items
. Examples for this are dicts
or pandas.Series
In this case the function will compare all values of the mapping to the query and return a sorted list of the results in the form [(<value>, <score>, <key>),...].
So e.g. using a pandas.Series
you will receive [(<value>, <score>, <index>),...].
any type that is iterable like e.g. Lists or Generators In this case the function will compare all values of the Iterable to the query and return a sorted list of the results in the form [(<value>, <score>),...].
processor is a function that is used to preprocess the query and each choice before they are compared. This can be any function that accepts a string as the argument and returns a string aswell.
By default this is rapidfuzz.utils.default_process
, which will lower case the string, remove non alphanumeric characters and remove the whitespaces from begin/end of the string.
scorer is the function that is used to compare the query to each choice and requires the following interface:
def scorer(s1, s2, processor, score_cutoff)
s1 and s2 are the two strings. Processor is generally passed as None by process.extract
to deactivate preprocessing, since this was already done inside process.extract
. Score_cutoff is the minimum similarity required between the two strings. If this similarity is not reached 0 should be returned instead.
By default this scorer is fuzz.WRatio
which is combining multiple different ratios and weights them. The other available scorers can be found here: scorer docs
Or you can create your own function with a similar interface to perform the comparision.
As described above score_cutoff is used to set a minimum similarity for the string matching. Elements below this similarity will not be added to the results. Providing this parameter helps to improve the performance, since it allows rapidfuzz to use faster algorithms to compare strings: When no score_cutoff is provided rapidfuzz will always use the levenshtein distance which has O(N*M) runtime. However when it is passed it will:
Using this parameter you can limit the amount of results to the limit
best matches in choices
. By default this is 5 so you will only get the 5 best results. By setting limit=None
you will get all results as long as they did not get filtered out by the score_cutoff
The most time consuming part in your function is the call to process.extract
.
extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):
There are couple of things that might help you to achieve a better performance:
You should probably use rapidfuzz since it is a lot faster. This is because I did implement everything in C++ and a lot of the algorithms of Fuzzywuzzy could be improved a lot.
You can deactivate pre-processing the strings by passing processor=None
to process.extract
. When you require the pre-processing for your data you can perform it once ahead of time so you do not have to perform it whenever you call Get_all
As described above providing a score_cutoff
can help to improve the performance
A last option would be to write a optimised version of process.extract
. Right now it calculates the scores for all choices passed. However after finding limit
choices with a score above score_cutoff
it would be possible to increase score_cutoff
to the smallest score in the results. However this would require maintaining a sorted list of the current best results, so it might or might not be worth it. When your only interested in the best result process.extractOne
is already using this technique to improve the runtime.
Upvotes: 6