Python string similarity (with complexity)

Question

I have a string I would like to match against a list of candidates. Here is an example:

# ignore case
string = "The Shining" # The Stanley Kubrick Movie
candidates = ['Shining', 'The shins', 'Shining, The'] 
most_similar(string, candidates)
==> 'Shining, The'

Doing a "literal string comparison", I usually use the Levenshtein distance or ratio in this case. However, I'd like to do a more sophisticated similarity test so that the best match in the above case is Shining, The.

I'm guessing that this is a common issue that has probably been solved extensively, so I was wondering what library/tool/etc. might be the best way to get what I'm trying to do?

Jay Mody · Accepted Answer

You're looking for the gensim or fuzzywuzzy package.

In this specific case, you're probably leaning towards fuzzywuzzy since you are just trying to do a string match.

gensim is more for calculating similarity scores and vector representations for documents, paragraphs, sentences, words, corpora, etc... with the goal of capturing semantic/topical meaning rather than literal string matching.

So in your case, using fuzzy string matching, you might do:

from fuzzywuzzy import fuzz

fuzz.partial_ratio('Shining', 'The shins')
>>> 50 

fuzz.partial_ratio('Shining', 'Shining, The')
>>> 100

fuzz.partial_ratio('Shining', 'unrelated')
>>> 14

The partial_ratio function is case sensitive, so you might want to lowercase all of your inputs. It'll output a score between 0 and 100 (100 being a very strong match). It's up to you how you filter out matches from there, maybe use a threshold: if score > 75: its a match.

I would recommend looking into the different functions in the fuzzywuzzy package, see what works best for you're case.

Python string similarity (with complexity)

Answers (1)

Related Questions