Reputation: 937
I want to remove all the proper nouns from a large corpus. Due to the large volume, I take a shortcut and remove all words starting with capital letters. For the first word of each sentence, I also want to check if it is a proper noun. How can I do this without using a tagger. One option is to do a screening using a list of common proper nouns. Is there a better way and where can I get such a list? Thanks.
I tried NLTK pos_tag and Standford NER. Without context, they do not work well.
ner_tagger = StanfordNERTagger(model,jar)
names = ner_tagger.tag(first_words)
types = ["DATE", "LOCATION", "ORGANIZATION", "PERSON", "TIME"]
for name, type in names:
if type in types:
print(name, type)
Below are some results.
Abnormal ORGANIZATION
Abnormally ORGANIZATION
Abraham ORGANIZATION
Absorption ORGANIZATION
Abundant ORGANIZATION
Abusive ORGANIZATION
Academic ORGANIZATION
Acadia ORGANIZATION
There are too many false positives since the first letter of a sentence is always capitalized. After I changed the words to all lower cases, NER even missed common entities such as America and American.
Upvotes: 4
Views: 637
Reputation: 322
You can make a list from your corpus, of the words that are capitalized when they are not at the sentence start. A Bloom filter would be an efficient way to store the results, since you are willing to tolerate false positives.
Upvotes: 1
Reputation: 1305
If you're just playing, you might dabble with Google's Natural Language API. They provide an "Entity Analysis", where entities fall into two categories, "proper nouns" (specific people or places) HINT HINT :-) or "common nouns".
I'm only suggesting this as a starting point. There's a threshold under which you can you can hit the API for free. I think it's around 5,000 "entities" per month?
Disclaimer: I have no commercial affiliation with Google and haven't used the API myself. I've worked on other language parsing projects and thought this sounded interesting.
Upvotes: 0