Reputation: 110163
I have a string I would like to match against a list of candidates. Here is an example:
# ignore case
string = "The Shining" # The Stanley Kubrick Movie
candidates = ['Shining', 'The shins', 'Shining, The']
most_similar(string, candidates)
==> 'Shining, The'
Doing a "literal string comparison", I usually use the Levenshtein distance or ratio in this case. However, I'd like to do a more sophisticated similarity test so that the best match in the above case is Shining, The
.
I'm guessing that this is a common issue that has probably been solved extensively, so I was wondering what library/tool/etc. might be the best way to get what I'm trying to do?
Upvotes: 2
Views: 1678
Reputation: 4033
You're looking for the gensim or fuzzywuzzy package.
In this specific case, you're probably leaning towards fuzzywuzzy
since you are just trying to do a string match.
gensim
is more for calculating similarity scores and vector representations for documents, paragraphs, sentences, words, corpora, etc... with the goal of capturing semantic/topical meaning rather than literal string matching.
So in your case, using fuzzy string matching, you might do:
from fuzzywuzzy import fuzz
fuzz.partial_ratio('Shining', 'The shins')
>>> 50
fuzz.partial_ratio('Shining', 'Shining, The')
>>> 100
fuzz.partial_ratio('Shining', 'unrelated')
>>> 14
The partial_ratio
function is case sensitive, so you might want to lowercase all of your inputs. It'll output a score between 0 and 100 (100 being a very strong match). It's up to you how you filter out matches from there, maybe use a threshold: if score > 75: its a match
.
I would recommend looking into the different functions in the fuzzywuzzy
package, see what works best for you're case.
Upvotes: 4