user53526356
user53526356

Reputation: 968

How to Find Company Names in Text Using Python

I have a list of properly-formatted company names, and I am trying to find when those companies appear in a document. The problem is that they are unlikely to appear in the document exactly as they do in the list. For example, Visa Inc may appear as Visa or American Airlines Group Inc may appear as American Airlines.

How would I go about iterating over the entire contents of the document and then return the properly formatted company name when a close match is found?

I have tried both fuzzywuzzy and difflib.get_close_matches, but the problem is it looks at each individual word rather than clusters of words:

from fuzzywuzzy import process
from difflib import get_close_matches

company_name = ['American Tower Inc', 'American Airlines Group Inc', 'Atlantic American Corp', 'American International Group']

text = 'American Tower is one company. American Airlines is another while there is also Atlantic American Corp but we cannot forget about American International Group Inc.'

#using fuzzywuzzy
for word in text.split():
    print('- ' + word+', ', ', '.join(map(str,process.extractOne(word, company_name))))

#using get_close_matches
for word in text.split():
    match = get_close_matches(word, company_name, n=1, cutoff=.4)
    print(match)

Upvotes: 3

Views: 9353

Answers (2)

Shogun187
Shogun187

Reputation: 88

For that type of task I use a record linkage algorithm, it will find those clusters for you with the help of ML. You will have to provide some actual examples so the algorithm can learn to label the rest of your dataset properly.

Here is some info: https://pypi.org/project/pandas-dedupe/

Cheers,

Upvotes: 0

Viseshini Reddy
Viseshini Reddy

Reputation: 819

I was working on a similar problem. Fuzzywuzzy internally uses difflib and both of them perform slowly on large datasets.

Chris van den Berg's pipeline converts company names into vectors of 3-grams using a TF-IDF matrix and then compares the vectors using cosine similarity.

The pipeline is quick and gives accurate results for partially matched strings too.

Upvotes: 3

Related Questions