How to avoid extracing non-proper nouns from headings in text with capitalization?

Question

I am trying to extract keywords from a piece of text using nltk and Stanford NLP tools. After I run my code, I can get a list like this

companyA
companyB
companyC
Trend Analysis For companyA

This is all good but notice the last item. That is actually a heading that appears in the text. Since all the words for a heading are capitalized, my program thinks that those are all proper nouns and thus clubs them together as if they were a big company name.

The good thing is that as long as a company has been mentioned somewhere in the text, my program will pick it up, hence I get individual items like companyA as well. These are coming from the actual piece of text that talks about that company.

Here is what I want to do.

In the list that I get above, is there a way to look at an item and determine if any previous items are a substring of the current one? For example, in this case when I come across

Trend Analysis For companyA

I can check whether I have seen any part of this before. So I can determine that I already have companyA and thus I will ignore Trend Analysis For companyA. I am confident that the text will mention any companies enough times for StanfordNER to pick it up. Thus I do not have to rely on headings to get what I need.

Does that make sense? Is this the correct approach? I am afraid that this will not be very efficient but i can't think of anything else.

Edit

Here is the code that i use

sentences = nltk.sent_tokenize(document) 
sentences = [nltk.word_tokenize(sent) for sent in sentences] 
sentences = [nltk.pos_tag(sent) for sent in sentences]

after that i simply use the StanfordNERTagger on each sentence

result = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
for s in sentences:
    taggedwords = stn.tag(s)
    for tag, chunk in groupby(taggedWords, lambda x:x[1]):

        if tag == "ORGANIZATION":

            result.append((tag, " ".join(w for w, t in chunk)))

return result

in this way i can get all the ORGANIZATIONs.

To @Alvas's point about Truecasing, dont you think its a bit of an overkill here? When i studied the algorithm, it appears to me that they are trying to come up with the most likely spelling for each word. The likelihood will be based on a corpus. I dont think that i will need to build a corpus as i can use a dictionary like wordnet or something like pyenchant to figure out the appropriate spelling. Also, here i already have all the information i need i.e. i am picking up all the companies mentioned.

There is another problem. Consider the company name

American Eagle Outfitters

note that American and american are both proper spellings. Simlarl for Eagle and eagle. I am afraid that even if i employ Truecasing into my algorithm, it will end up lowercasing terms that should not be lowercased.

Again, my problem right now is that i have all the company names extracted, but i am also extracting the headings. The brute force way wold be to perform a substring check on the list of results. I was just wondering whether there is a more efficient way of doing this. Moreover, i dont think that any tweaking that i do will improve the tagging. I dont think i will be able to outperform StanfordNERTagger

How to avoid extracing non-proper nouns from headings in text with capitalization?

Answers (1)

Related Questions