NLTK BigramTagger does not tag half of the sentence

Question

Can someone please explain the behaviour of NLTK's BigramTagger in these examples?

I instantiated the tagger by

bi= BigramTagger(brown.tagged_sents(categories='news')[:500])

Now, I want to use this on one specific sentence.

>>> bi.tag(brown_sents[2])
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

Works well, but hey, it's all known data. Let me change one word and see if it sets something off.

>>> sent=brown_sents[2]
>>> sent[5]
u'been'
>>> sent[5] = u'was'
>>> bi.tag(sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'was', None), (u'charged', None), (u'by', None), (u'Fulton', None), (u'Superior', None), (u'Court', None), (u'Judge', None), (u'Durwood', None), (u'Pye', None), (u'to', None), (u'investigate', None), (u'reports', None), (u'of', None), (u'possible', None), (u'``', None), (u'irregularities', None), (u"''", None), (u'in', None), (u'the', None), (u'hard-fought', None), (u'primary', None), (u'which', None), (u'was', None), (u'won', None), (u'by', None), (u'Mayor-nominate', None), (u'Ivan', None), (u'Allen', None), (u'Jr.', None), (u'.', None)]

Now I expected to see changed tuple, (u'been', u'BEN') to now be (u'been', None). Why is everything after it in the sentence now not tagged? Those words were tagged in connection to another ones, not 'been'.

Any recommendation on use of tagged sentences would be appreciated as well.

alvas · Accepted Answer

You have to set a backoff tagger when using *gramTagger so that if the specific ngram is not seen in the training data, it will backoff to a tagger trained on a lower order ngram. See "Combining Taggers" section in http://www.nltk.org/book/ch05.html

>>> from nltk import DefaultTagger, UnigramTagger, BigramTagger
>>> from nltk.corpus import brown
>>> text = brown.tagged_sents(categories='news')[:500]
>>> t0 = DefaultTagger('NN')
>>> t1 = UnigramTagger(text, backoff=t0)
>>> t2 = BigramTagger(text, backoff=t1)

>>> test_sent = brown.sents()[502]
>>> test_sent
[u'Noting', u'that', u'Plainfield', u'last', u'year', u'had', u'lost', u'the', u'Mack', u'Truck', u'Co.', u'plant', u',', u'he', u'said', u'industry', u'will', u'not', u'come', u'into', u'this', u'state', u'until', u'there', u'is', u'tax', u'reform', u'.']
>>> t2.tag(test_sent)
[(u'Noting', u'VBG'), (u'that', u'CS'), (u'Plainfield', u'NP-HL'), (u'last', u'AP'), (u'year', u'NN'), (u'had', u'HVD'), (u'lost', u'VBD'), (u'the', u'AT'), (u'Mack', 'NN'), (u'Truck', 'NN'), (u'Co.', u'NN-TL'), (u'plant', 'NN'), (u',', u','), (u'he', u'PPS'), (u'said', u'VBD'), (u'industry', 'NN'), (u'will', u'MD'), (u'not', u'*'), (u'come', u'VB'), (u'into', u'IN'), (u'this', u'DT'), (u'state', u'NN'), (u'until', 'NN'), (u'there', u'EX'), (u'is', u'BEZ'), (u'tax', 'NN'), (u'reform', 'NN'), (u'.', u'.')]

And to show that it works with your example in the question ;P

>>> test_sent = brown.sents()[2]
>>> test_sent
[u'The', u'September-October', u'term', u'jury', u'had', u'been', u'charged', u'by', u'Fulton', u'Superior', u'Court', u'Judge', u'Durwood', u'Pye', u'to', u'investigate', u'reports', u'of', u'possible', u'``', u'irregularities', u"''", u'in', u'the', u'hard-fought', u'primary', u'which', u'was', u'won', u'by', u'Mayor-nominate', u'Ivan', u'Allen', u'Jr.', u'.']
>>> t2.tag(test_sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', 'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', 'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

At some point, you might realize that Python NLTK pos_tag not returning the correct part-of-speech tag

NLTK BigramTagger does not tag half of the sentence

Answers (1)

Related Questions