shashank
shashank

Reputation: 410

Match Names of the Companies approximately

I have 12 Million company names in my db. I want to match them with a list offline. I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like

G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated

Upvotes: 2

Views: 11219

Answers (4)

Michel Samia
Michel Samia

Reputation: 4467

This library looks promissing. Highly configurable, uses Levenshtein distance, Cosine distance and special suffix handling like Inc., Ltd. etc.

https://pypi.org/project/name-matching/

Upvotes: 1

user3643160
user3643160

Reputation: 1

Use MatchKraft to fuzzy match company names on two lists.

http://www.matchkraft.com/

Levenstiens distance is not enough to solve this problem. You also need the following:

  1. Heuristics to improve execution time
  2. Information retrieval (Lucene) and SQL
  3. Company names database

It is better to use an existing tool rather than creating your program in Python.

Upvotes: -2

Deepak Saini
Deepak Saini

Reputation: 2900

You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')

Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:

import spacy

nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Upvotes: 2

AKX
AKX

Reputation: 168824

You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:

  • \s+inc\.?$ -> Incorporated
  • \s+corp\.?$ -> Corporation

You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.

You can then use Levenshtein distance or another fuzzy matching algorithm.

Upvotes: 2

Related Questions