muazfaiz
muazfaiz

Reputation: 5021

Matching company names in the news data using Python

I have news dataset which contains almost 10,000 news over the last 3 years. I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not. Example:

company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'

Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case. I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:

Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.

Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.

Can someone help in telling the right way of doing this ? Thanks.

Upvotes: 1

Views: 2220

Answers (3)

MiniQuark
MiniQuark

Reputation: 48436

It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/

It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".

Ok so let's assume you have the list of company names with their aliases:

import ahocorasick
A = ahocorasick.Automaton()

companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
    A.add_word(key, idx)

Next, make the automaton (see the link above for details on the algorithm):

A.make_automaton()

Great! Now you can simply search for all companies in some text:

your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""

for end_index, idx in A.iter(your_text.lower()):
    print(end_index, companies[idx])

This is the output:

15 apple
49 google
74 apple

The numbers correspond to the index of the last character of the company name in the text.

Easy, right? And super fast, this algorithm is used by some variants of GNU grep.

Saving/loading the automaton

If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:

# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))

In the program that will use this automaton, you start by loading the automaton:

# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton

Hope this helps! :)

Bonus details

If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :

companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))

Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article). Also for case insensitive matches pass re.I to compile.

If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:

for line in all_lines:
  if line.startswith("company Name:"):
      name = companies_re.search(line) 
      if name:
         ...
      break

Upvotes: 1

Kyriediculous
Kyriediculous

Reputation: 119

You can try

  difflib.get_close_matches

with the full company name.

Upvotes: -1

Related Questions