Reputation: 5021
I have a list of companies like the following
companies = [Advance Auto Parts Inc, AllianceBernstein Holding L.P., AbbVie Inc., Asbury Automotive Group Inc, ABM Industries Incorporated]
I also have daily news data.
News = ['news1', 'news2', 'news3']
Now I want to search these names in the news data but in News company names do not occur as complete names like in the list above. I want to do something like this:
for news in News:
for company in companies:
if (company in news):
print('do something')
The best idea right now comes to my mind is to take the names of all the companies in a way that are expected to be called in News but that will take a lot of time because I have thousands of companies. Any suggestions to handle this problem ? Thanks.
Upvotes: 1
Views: 996
Reputation: 477
Try identifying the most common endings first; e.g. Inc or Ltd, then you'll be able to search the news for Advance Auto Parts Inc and the stripped version Advance Auto Parts. Then you could try to find out if there are any other words like Group or strings like And Sons in the name.
Every time run the news searching function with the whole name and then with each one of the stripped versions.
news(company_name):
stripped_versions = company_name
stripped_versions += strip(company_name)
for version in stripped_versions:
search_news(version)
Where stripped_versions is a list of the company's name including the stripped versions, for example: [Advance Auto Parts Inc, Advance Auto Parts]
I hope this pseudo-code helped you approach your problem
Upvotes: 2
Reputation: 19
I would suggest you to pick up company name lists from the internet itself and rebuild your list. Tweaking up google searches using google dorks might help you.
Like putting
list of fortune 500 companies ext:xls
The above dork in google search bar will pop up some xls file with the list . I think that will still require some manual work but at better ease
Upvotes: 0