Reputation: 968
I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc
may appear as Visa
or American Airlines Group Inc
may appear as American Airlines
.
How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?
I have tried both fuzzywuzzy
and difflib.get_close_matches
, but the problem is it looks at each individual word rather than clusters of words:
from fuzzywuzzy import process
from difflib import get_close_matches
company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']
text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'
#using fuzzywuzzy
for word in text.split():
print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))
#using get_close_matches
for word in text.split():
match = get_close_matches(word, company_name, n=1, cutoff=.4)
print(match)
Upvotes: 3
Views: 9353
Reputation: 88
For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.
Here is some info: https://pypi.org/project/pandas-dedupe/
Cheers,
Upvotes: 0
Reputation: 819
I was working on a similar problem. Fuzzywuzzy
internally uses difflib
and both of them perform slowly on large datasets.
Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.
The pipeline is quick and gives accurate results for partially matched strings too.
Upvotes: 3