Reputation: 937

How can I check if the first word of a sentence is a proper noun?

I want to remove all the proper nouns from a large corpus. Due to the large volume, I take a shortcut and remove all words starting with capital letters. For the first word of each sentence, I also want to check if it is a proper noun. How can I do this without using a tagger. One option is to do a screening using a list of common proper nouns. Is there a better way and where can I get such a list? Thanks.

I tried NLTK pos_tag and Standford NER. Without context, they do not work well.

 ner_tagger = StanfordNERTagger(model,jar)
 names = ner_tagger.tag(first_words)
 types = ["DATE", "LOCATION", "ORGANIZATION", "PERSON", "TIME"]

 for name, type in names:
     if type in types:
        print(name, type)

Below are some results.

  Abnormal ORGANIZATION
  Abnormally ORGANIZATION
  Abraham ORGANIZATION
  Absorption ORGANIZATION
  Abundant ORGANIZATION
  Abusive ORGANIZATION
  Academic ORGANIZATION
  Acadia ORGANIZATION

There are too many false positives since the first letter of a sentence is always capitalized. After I changed the words to all lower cases, NER even missed common entities such as America and American.

Upvotes: 4

Answers (2)

chrishmorris

Reputation: 322

You can make a list from your corpus, of the words that are capitalized when they are not at the sentence start. A Bloom filter would be an efficient way to store the results, since you are willing to tolerate false positives.

Upvotes: 1

markaaronky

Reputation: 1305

If you're just playing, you might dabble with Google's Natural Language API. They provide an "Entity Analysis", where entities fall into two categories, "proper nouns" (specific people or places) HINT HINT :-) or "common nouns".

I'm only suggesting this as a starting point. There's a threshold under which you can you can hit the API for free. I think it's around 5,000 "entities" per month?

Disclaimer: I have no commercial affiliation with Google and haven't used the API myself. I've worked on other language parsing projects and thought this sounded interesting.

Upvotes: 0

How can I check if the first word of a sentence is a proper noun?

Answers (2)

Related Questions