Reputation:
I'm trying to make a 'fix faulty capitalisation' program, and I'm trying to find proper nouns in python using NLTK's pos tagger. The problem is that it doesn't seem to be working very well for text with faulty/missing capitalisation.
This is the code I have so far:
import nltk
text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."
tokenized_words = nltk.word_tokenize(text)
pos_tagged_text = nltk.pos_tag(tokenized_words)
print(pos_tagged_text)
And the output is:
[('This', 'DT'), ('is', 'VBZ'), ('My', 'PRP$'), ('text', 'NN'), ('.', '.'), ('Unicorns', 'NNS'), ('are', 'VBP'), ('very', 'RB'), ('Nice', 'NNP'), (',', ','), ('I', 'PRP'), ('think', 'VBP'), ('.', '.'), ('how', 'WRB'), ('do', 'VB'), ('you', 'PRP'), ('do', 'VB'), ('?', '.'), ('are', 'VBP'), ('you', 'PRP'), ('okay', 'JJ'), ('!', '.'), ('testing', 'VBG'), ('capitalisation', 'NN'), ('.', '.'), ('my', 'PRP$'), ('nice', 'JJ'), ('Friend', 'NNP'), ('is', 'VBZ'), ('called', 'VBN'), ('bob', 'NN'), ('he', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('america', 'NN'), ('.', '.')]
As you can see, there's quite a few mistakes. "Nice" gets tagged as a proper noun, as does "Friend", while "bob" and "america" don't.
How I can find proper nouns regardless of capitalisation?
Upvotes: 1
Views: 945
Reputation: 2126
I recommend using the python library spaCy, their models have great accuracy for part-of-speech tagging. If the casing of the original text isn't reliable, I suggest lower-casing the entire text to reduce false positives.
import spacy
nlp = spacy.load('en_core_web_lg')
text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."
doc = nlp(text.lower())
print([tok for tok in doc if tok.pos_=='PROPN']) # extract all proper nouns
Output:
[bob, america]
Upvotes: 2