AbtPst
AbtPst

Reputation: 8018

How to avoid extracing non-proper nouns from headings in text with capitalization?

I am trying to extract keywords from a piece of text using nltk and Stanford NLP tools. After I run my code, I can get a list like this

companyA
companyB
companyC
Trend Analysis For companyA

This is all good but notice the last item. That is actually a heading that appears in the text. Since all the words for a heading are capitalized, my program thinks that those are all proper nouns and thus clubs them together as if they were a big company name.

The good thing is that as long as a company has been mentioned somewhere in the text, my program will pick it up, hence I get individual items like companyA as well. These are coming from the actual piece of text that talks about that company.

Here is what I want to do.

In the list that I get above, is there a way to look at an item and determine if any previous items are a substring of the current one? For example, in this case when I come across

Trend Analysis For companyA

I can check whether I have seen any part of this before. So I can determine that I already have companyA and thus I will ignore Trend Analysis For companyA. I am confident that the text will mention any companies enough times for StanfordNER to pick it up. Thus I do not have to rely on headings to get what I need.

Does that make sense? Is this the correct approach? I am afraid that this will not be very efficient but i can't think of anything else.

Edit

Here is the code that i use

sentences = nltk.sent_tokenize(document) 
sentences = [nltk.word_tokenize(sent) for sent in sentences] 
sentences = [nltk.pos_tag(sent) for sent in sentences] 

after that i simply use the StanfordNERTagger on each sentence

result = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
for s in sentences:
    taggedwords = stn.tag(s)
    for tag, chunk in groupby(taggedWords, lambda x:x[1]):

        if tag == "ORGANIZATION":

            result.append((tag, " ".join(w for w, t in chunk)))

return result

in this way i can get all the ORGANIZATIONs.

To @Alvas's point about Truecasing, dont you think its a bit of an overkill here? When i studied the algorithm, it appears to me that they are trying to come up with the most likely spelling for each word. The likelihood will be based on a corpus. I dont think that i will need to build a corpus as i can use a dictionary like wordnet or something like pyenchant to figure out the appropriate spelling. Also, here i already have all the information i need i.e. i am picking up all the companies mentioned.

There is another problem. Consider the company name

American Eagle Outfitters

note that American and american are both proper spellings. Simlarl for Eagle and eagle. I am afraid that even if i employ Truecasing into my algorithm, it will end up lowercasing terms that should not be lowercased.

Again, my problem right now is that i have all the company names extracted, but i am also extracting the headings. The brute force way wold be to perform a substring check on the list of results. I was just wondering whether there is a more efficient way of doing this. Moreover, i dont think that any tweaking that i do will improve the tagging. I dont think i will be able to outperform StanfordNERTagger

Upvotes: 0

Views: 881

Answers (1)

eldams
eldams

Reputation: 750

I did encounter a similar problem, but the whole text was uncapitalized (ASR output). In that case I did retrain NER model on uncapitalized annotated data to obtain better performances.

Here are a few options I would consider, by order of preference (and complexity):

  • Uncapitalize text: After tokenization / sentence splitting, try to guess what sentences are all capitalized, and use a dictionary-based approach to uncapitalize unambiguous tokens (this may be viewed as a sequence labelling problem, and could involve machine learning... but data can be easily generated for this problem).
  • Learn a model with capitalization features You may add a feature to machine learning to rebuild both POS and NER models, but this would require to have corpora to retrain models.
  • Postprocess data Given the fact that the tagging is error-prone, you may apply some postprocessing, which may take into account the previously discovered entities with substring matching. But this process would be hazardous, since if you find "America" and "Bank Of America", the correction will probably not be able to know that "Bank Of America" is actually an entity.

I would personnaly consider first option, as you can easily artificially create an all-capitalized corpus (from correctly capitalized texts) and train a sequence labeller model (e.g. CRF) model to detect where capitalization should be removed.

Whatever approach you use, you'll indeed never end up with performances as good as for correctly capitalized texte. Your input can be considered as partly noisy, since a clue is missing both for POS tagging and NER.

Upvotes: 2

Related Questions