How to find uncapitalised proper nouns with NLTK?

Question

I'm trying to make a 'fix faulty capitalisation' program, and I'm trying to find proper nouns in python using NLTK's pos tagger. The problem is that it doesn't seem to be working very well for text with faulty/missing capitalisation.

This is the code I have so far:

import nltk

text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."

tokenized_words = nltk.word_tokenize(text)
pos_tagged_text = nltk.pos_tag(tokenized_words)
print(pos_tagged_text)

And the output is:

[('This', 'DT'), ('is', 'VBZ'), ('My', 'PRP$'), ('text', 'NN'), ('.', '.'), ('Unicorns', 'NNS'), ('are', 'VBP'), ('very', 'RB'), ('Nice', 'NNP'), (',', ','), ('I', 'PRP'), ('think', 'VBP'), ('.', '.'), ('how', 'WRB'), ('do', 'VB'), ('you', 'PRP'), ('do', 'VB'), ('?', '.'), ('are', 'VBP'), ('you', 'PRP'), ('okay', 'JJ'), ('!', '.'), ('testing', 'VBG'), ('capitalisation', 'NN'), ('.', '.'), ('my', 'PRP$'), ('nice', 'JJ'), ('Friend', 'NNP'), ('is', 'VBZ'), ('called', 'VBN'), ('bob', 'NN'), ('he', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('america', 'NN'), ('.', '.')]

As you can see, there's quite a few mistakes. "Nice" gets tagged as a proper noun, as does "Friend", while "bob" and "america" don't.

How I can find proper nouns regardless of capitalisation?

How to find uncapitalised proper nouns with NLTK?

Answers (1)

Related Questions