user13987160
user13987160

Reputation:

How to find uncapitalised proper nouns with NLTK?

I'm trying to make a 'fix faulty capitalisation' program, and I'm trying to find proper nouns in python using NLTK's pos tagger. The problem is that it doesn't seem to be working very well for text with faulty/missing capitalisation.

This is the code I have so far:

import nltk

text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."

tokenized_words = nltk.word_tokenize(text)
pos_tagged_text = nltk.pos_tag(tokenized_words)
print(pos_tagged_text)

And the output is:

[('This', 'DT'), ('is', 'VBZ'), ('My', 'PRP$'), ('text', 'NN'), ('.', '.'), ('Unicorns', 'NNS'), ('are', 'VBP'), ('very', 'RB'), ('Nice', 'NNP'), (',', ','), ('I', 'PRP'), ('think', 'VBP'), ('.', '.'), ('how', 'WRB'), ('do', 'VB'), ('you', 'PRP'), ('do', 'VB'), ('?', '.'), ('are', 'VBP'), ('you', 'PRP'), ('okay', 'JJ'), ('!', '.'), ('testing', 'VBG'), ('capitalisation', 'NN'), ('.', '.'), ('my', 'PRP$'), ('nice', 'JJ'), ('Friend', 'NNP'), ('is', 'VBZ'), ('called', 'VBN'), ('bob', 'NN'), ('he', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('america', 'NN'), ('.', '.')]

As you can see, there's quite a few mistakes. "Nice" gets tagged as a proper noun, as does "Friend", while "bob" and "america" don't.

How I can find proper nouns regardless of capitalisation?

Upvotes: 1

Views: 945

Answers (1)

thorntonc
thorntonc

Reputation: 2126

I recommend using the python library spaCy, their models have great accuracy for part-of-speech tagging. If the casing of the original text isn't reliable, I suggest lower-casing the entire text to reduce false positives.

import spacy

nlp = spacy.load('en_core_web_lg')

text = "This is My text. Unicorns are very Nice, I think. how do you do? are you okay! testing capitalisation. my nice Friend is called bob he lives in america."
doc = nlp(text.lower())
print([tok for tok in doc if tok.pos_=='PROPN'])  # extract all proper nouns

Output:

[bob, america]

Upvotes: 2

Related Questions