Reputation: 410
I have 12 Million company names in my db. I want to match them with a list offline. I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
Upvotes: 2
Views: 11219
Reputation: 4467
This library looks promissing. Highly configurable, uses Levenshtein distance, Cosine distance and special suffix handling like Inc., Ltd. etc.
https://pypi.org/project/name-matching/
Upvotes: 1
Reputation: 1
Use MatchKraft to fuzzy match company names on two lists.
Levenstiens distance is not enough to solve this problem. You also need the following:
It is better to use an existing tool rather than creating your program in Python.
Upvotes: -2
Reputation: 2900
You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
Upvotes: 2
Reputation: 168824
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$
-> Incorporated
\s+corp\.?$
-> Corporation
You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.
Upvotes: 2