Reputation: 2997
I have a Python implementation of fuzzy matching using the Levenshtein similarity. I'm pretty happy with it but I feel I'm leaving a lot on the table by not considering the structure of the strings.
Here are some examples of matches that are clearly good, but not captured well by Levenshtein :
The Hobbit
/ Hobbit/The
Charlies Angles
/ Charlie's Angels
Apples & Pairs
/ Apples and Pairs
I think some normalization ahead of using Levenshtein would be good - eg. replace all &
with and
, remove punctuation, etc... not sure I want to jump straight to stop-word removal and lematization, but something along those line
To avoid re-inventing the wheel, is there any easy way to do this? Or an alternative to levenshtine that addresses these issues (short of some Bert embeddings)
Upvotes: 0
Views: 1172
Reputation: 357
rapidfuzz.utils.default_process
might be an option to consider for preprocessing.
rapidfuzz.utils.default_process(sentence: str) → str This function preprocesses a string by:
- removing all non alphanumeric characters
- trimming whitespaces
- converting all characters to lower case
PARAMETERS: sentence (str) – String to preprocess
RETURNS: processed_string – processed string
RETURN TYPE: str
https://maxbachmann.github.io/RapidFuzz/Usage/utils.html
Upvotes: 3
Reputation: 1048
yes you can use some preporcessing like below and Remove non-alphanumeric characters or convert all to lowercasess or extra spaces:
def preprocess_string(s):
s = s.lower()
s = s.replace('&', 'and')
s = re.sub(r'[^a-z0-9 ]', '', s)
s = re.sub(r'\s+', ' ', s).strip()
s = re.sub(r'&', 'and', s).strip()
return s
in fact preprocessing is always crucial for this kind of comparison.
Upvotes: 0