Reputation:
Can someone please explain the behaviour of NLTK's BigramTagger in these examples?
I instantiated the tagger by
bi= BigramTagger(brown.tagged_sents(categories='news')[:500])
Now, I want to use this on one specific sentence.
>>> bi.tag(brown_sents[2])
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
Works well, but hey, it's all known data. Let me change one word and see if it sets something off.
>>> sent=brown_sents[2]
>>> sent[5]
u'been'
>>> sent[5] = u'was'
>>> bi.tag(sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'was', None), (u'charged', None), (u'by', None), (u'Fulton', None), (u'Superior', None), (u'Court', None), (u'Judge', None), (u'Durwood', None), (u'Pye', None), (u'to', None), (u'investigate', None), (u'reports', None), (u'of', None), (u'possible', None), (u'``', None), (u'irregularities', None), (u"''", None), (u'in', None), (u'the', None), (u'hard-fought', None), (u'primary', None), (u'which', None), (u'was', None), (u'won', None), (u'by', None), (u'Mayor-nominate', None), (u'Ivan', None), (u'Allen', None), (u'Jr.', None), (u'.', None)]
Now I expected to see changed tuple, (u'been', u'BEN')
to now be (u'been', None). Why is everything after it in the sentence now not tagged? Those words were tagged in connection to another ones, not 'been'.
Any recommendation on use of tagged sentences would be appreciated as well.
Upvotes: 3
Views: 1612
Reputation: 122280
You have to set a backoff tagger when using *gramTagger so that if the specific ngram is not seen in the training data, it will backoff to a tagger trained on a lower order ngram. See "Combining Taggers" section in http://www.nltk.org/book/ch05.html
>>> from nltk import DefaultTagger, UnigramTagger, BigramTagger
>>> from nltk.corpus import brown
>>> text = brown.tagged_sents(categories='news')[:500]
>>> t0 = DefaultTagger('NN')
>>> t1 = UnigramTagger(text, backoff=t0)
>>> t2 = BigramTagger(text, backoff=t1)
>>> test_sent = brown.sents()[502]
>>> test_sent
[u'Noting', u'that', u'Plainfield', u'last', u'year', u'had', u'lost', u'the', u'Mack', u'Truck', u'Co.', u'plant', u',', u'he', u'said', u'industry', u'will', u'not', u'come', u'into', u'this', u'state', u'until', u'there', u'is', u'tax', u'reform', u'.']
>>> t2.tag(test_sent)
[(u'Noting', u'VBG'), (u'that', u'CS'), (u'Plainfield', u'NP-HL'), (u'last', u'AP'), (u'year', u'NN'), (u'had', u'HVD'), (u'lost', u'VBD'), (u'the', u'AT'), (u'Mack', 'NN'), (u'Truck', 'NN'), (u'Co.', u'NN-TL'), (u'plant', 'NN'), (u',', u','), (u'he', u'PPS'), (u'said', u'VBD'), (u'industry', 'NN'), (u'will', u'MD'), (u'not', u'*'), (u'come', u'VB'), (u'into', u'IN'), (u'this', u'DT'), (u'state', u'NN'), (u'until', 'NN'), (u'there', u'EX'), (u'is', u'BEZ'), (u'tax', 'NN'), (u'reform', 'NN'), (u'.', u'.')]
And to show that it works with your example in the question ;P
>>> test_sent = brown.sents()[2]
>>> test_sent
[u'The', u'September-October', u'term', u'jury', u'had', u'been', u'charged', u'by', u'Fulton', u'Superior', u'Court', u'Judge', u'Durwood', u'Pye', u'to', u'investigate', u'reports', u'of', u'possible', u'``', u'irregularities', u"''", u'in', u'the', u'hard-fought', u'primary', u'which', u'was', u'won', u'by', u'Mayor-nominate', u'Ivan', u'Allen', u'Jr.', u'.']
>>> t2.tag(test_sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', 'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', 'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
At some point, you might realize that Python NLTK pos_tag not returning the correct part-of-speech tag
Upvotes: 4