Jstuff
Jstuff

Reputation: 1342

Differentiate a list between human names and company names

I have a list of companies, but some of these companies are simply names of people. I want to eliminate these people from the list, but I am having trouble finding a way to identify the names of people from the companies.

Through online research I have tried two ways. The first is using the nltk. My code looks like

y = ['INOVATIA LABORATORIES LLC', 'PRULLAGE PHD JOSEPH B', 'S J SMITH CO INC', 'TEVA PHARMACEUTICALS USA INC', 'KENT NUTRITION GROUP INC', 'JOSEPH D WAGENKNECHT', 'ROBERTSON KEITH', 'LINCARE INC', 'AGCHOICE - BLUE MOUND']

In the above list I would want to remove PRULLAGE PHD JOSEPH B, JOSEPH D WAGENKNECHT, and ROBERTSON KEITH.

z = []
for company in y:
    tokens = nltk.tokenize.word_tokenize(company)
    z.append(nltk.pos_tag(tokens))

This does not work because it tags everything as a proper noun. I then lowercased everything and only made the first letter of each word uppercase using the .title(), but this also failed for similar reasons.

The other method I tried was using the Human Name Parser module, but this also did not work because it tags the company names as the first and last name of the person.

Is there a way that I can differentiate the above list between human names and company names?

Upvotes: 3

Views: 3127

Answers (3)

wnnmaw
wnnmaw

Reputation: 5524

I don't believe you can do this entirely programatically, so some manual operation will be needed. However, you can make things a little easier with itertools.groupby

As pointed out in some comments, companies are likely to contain certain keywords, so we can create a list of these to use:

key_words = ["INC", "LLC", "CO", "GROUP"]

From here, we can sort the list by whether or not an item contains one of those key words (this is necessary to group):

y.sort(key=lambda name: any(key_word in name for key_word in key_words))    

In your example, this will list

['PRULLAGE PHD JOSEPH B', 'JOSEPH D WAGENKNECHT', 'ROBERTSON KEITH', 'AGCHOICE - BLUE MOUND', 'INOVATIA LABORATORIES LLC', 'S J SMITH CO INC', 'TEVA PHARMACEUTICALS USA INC', 'KENT NUTRITION GROUP INC', 'LINCARE INC']

From here, we can group into things that are probably not companies (those which dont contain any key words) and things which are definitely companies (those that do contain key words):

import itertools
I = itertools.groupby(y, lambda name: any(key_word in name for key_word in key_words))

So we now have two groups:

for i in I:
    print i[0], list(i[1])
False ['PRULLAGE PHD JOSEPH B', 'JOSEPH D WAGENKNECHT', 'ROBERTSON KEITH', 'AGCHOICE - BLUE MOUND']
True ['INOVATIA LABORATORIES LLC', 'S J SMITH CO INC', 'TEVA PHARMACEUTICALS USA INC', 'KENT NUTRITION GROUP INC', 'LINCARE INC']

You can then manually sort through the false group and remove companies, or apply another similar filter method to further improve the matching. Some other filters to apply:

  • Anything which contains "MR", "MS", "MRS", "PHD", "DR" is pretty likely to be a person
  • Words of the form "multiple_letters<space>single_letter<space>multiple_letters" are probably names, you can do this matching with re

Upvotes: 2

Muhammad Yaseen Khan
Muhammad Yaseen Khan

Reputation: 809

As far as I understand, you need to differentiate the company and human names. I guess the companies in your list end with either LLC, INC or contains a - (hyphen), thus I made a set of these words company_set as {'LLC', 'INC', '-'} and then split it into tokens via base function split(). If a intersection of company_set and splited tokens have anything in common then it will not an empty set, hence company message is printed otherwise human's message. Below is the code:

y = ['INOVATIA LABORATORIES LLC', 'PRULLAGE PHD JOSEPH B', 'S J SMITH CO INC', 'TEVA PHARMACEUTICALS USA INC', 'KENT NUTRITION GROUP INC', 'JOSEPH D WAGENKNECHT', 'ROBERTSON KEITH', 'LINCARE INC', 'AGCHOICE - BLUE MOUND']
company_set = {'LLC', 'INC', '-'}
for item in y:
    tokens = set(item.split())
    if company_set.intersection(tokens) !=  set():
        print "{} is a company".format(item)
    else:
        print "{} is a human".format(item)

And it outputs as follows:

INOVATIA LABORATORIES LLC is a company
PRULLAGE PHD JOSEPH B is a human
S J SMITH CO INC is a company
TEVA PHARMACEUTICALS USA INC is a company
KENT NUTRITION GROUP INC is a company
JOSEPH D WAGENKNECHT is a human
ROBERTSON KEITH is a human
LINCARE INC is a company
AGCHOICE - BLUE MOUND is a company

Upvotes: 2

handle
handle

Reputation: 6329

Test the list elements for indicators of company names. For your list, this is INC, LLC, and the hyphen (which could be part of a person's name). Or parts of company names (lab, pharma, solutions, ..). There may be other criteria (syllables, phonetics). Otherwise, you'd need a dictionary of names or companys to test.

y = ['INOVATIA LABORATORIES LLC', 'PRULLAGE PHD JOSEPH B', 'S J SMITH CO INC', 'TEVA PHARMACEUTICALS USA INC', 'KENT NUTRITION GROUP INC', 'JOSEPH D WAGENKNECHT', 'ROBERTSON KEITH', 'LINCARE INC', 'AGCHOICE - BLUE MOUND']
f = ["INC", "LLC", "-"]
c = []
for n in y:
  for t in f:
    if t in n:
      c.append(n)
print( "\n".join(c) )

gives

> t
INOVATIA LABORATORIES LLC
S J SMITH CO INC
TEVA PHARMACEUTICALS USA INC
KENT NUTRITION GROUP INC
LINCARE INC
AGCHOICE - BLUE MOUND

Upvotes: 1

Related Questions