user1946217
user1946217

Reputation: 1753

Backoff Tagger in nltk

I am new to python coding.I want to use the UnigramTagger along with a backoff(which is in my case a RegexpTagger) and I have been struggling hard to figure out what the below error is. Appreciate any help on this.

>>> train_sents = (['@Sakshi', 'Hi', 'I', 'am', 'meeting', 'my', 'friend', 'today'])    
>>> from tag_util import patterns  
>>> from nltk.tag import RegexpTagger  
>>> re_tagger = RegexpTagger(patterns)  
>>> from nltk.tag import UnigramTagger  
>>> from tag_util import backoff_tagger  
>>> tagger = backoff_tagger(train_sents, UnigramTagger, backoff=re_tagger)

Traceback (most recent call last):  
 File "<pyshell#6>", line 1, in <module>  
    tagger = backoff_tagger(train_sents, UnigramTagger, backoff=re_tagger)  
  File "tag_util.py", line 12, in backoff_tagger  
     for cls in tagger_classes:  
TypeError: 'YAMLObjectMetaclass' object is not iterable

This is the code I have in tag_util for patterns and backoff_tagger

import re  
patterns = [  
    (r'^@\w+', 'NNP'),  
    (r'^\d+$', 'CD'),  
    (r'.*ing$', 'VBG'), # gerunds, i.e. wondering  
    (r'.*ment$', 'NN'),  
    (r'.*ful$', 'JJ'), # i.e. wonderful  
    (r'.*', 'NN')  
]  

def backoff_tagger(train_sents, tagger_classes, backoff=None):
    for cls in tagger_classes:
        backoff = cls(train_sents, backoff=backoff)
    return backoff

Upvotes: 3

Views: 3407

Answers (2)

Paulo
Paulo

Reputation: 1

If you are using backoff_tagger that I am thinking. UnigramTagger should be an item of a list as below:

tagger = backoff_tagger(train_sents, [UnigramTagger], backoff=re_tagger)

I hope it helps.

Upvotes: 0

Jared
Jared

Reputation: 26407

You only need to change a few things for this to work.

The error you are getting is because you cannot iterate over the class UnigramTagger. I'm not sure if you had something else in mind but just remove the for loop. Also, you need to pass UnigramTagger a list of tagged sentences represented as lists of (word, tag) tuples - not just a list of words. Otherwise, it doesn't know how to train. Part of this might look like:

[[('@Sakshi', 'NN'), ('Hi', 'NN'),...],...[('Another', 'NN'), ('sentence', 'NN')]]

Notice here that each sentence is itself a list. Also, you can use a tagged corpus from NTLK for this (which I recommend).

Edit:

After reading your post it seems to me that you're both confused about what input/output to expect from certain functions and lacking an understanding of training in the NLP sense. I think you would greatly benefit from reading the NLTK book, starting at the beginning.

I'm glad to show you how to fix this but I don't think you'll have a complete understanding of the underlying mechanisms without some more research.

tag_util.py (based on your code)

from nltk.tag import RegexpTagger, UnigramTagger
from nltk.corpus import brown

patterns = [
    (r'^@\w+', 'NNP'),
    (r'^\d+$', 'CD'),
    (r'.*ing$', 'VBG'),
    (r'.*ment$', 'NN'),
    (r'.*ful$', 'JJ'),
    (r'.*', 'NN')
]
re_tagger = RegexpTagger(patterns)
tagger = UnigramTagger(brown.tagged_sents(), backoff=re_tagger) # train tagger

In the Python interpreter

>>> import tag_util
>>> tag_util.brown.tagged_sents()[:2]
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')]]

Notice the output here. I am getting the first two sentences from the Brown corpus of tagged sentences. This is the kind of data you need to pass to a tagger as input (like the UnigramTagger) to train it. Now lets use the tagger we trained in tag_util.py.

Back to the Python interpreter

>>> tag_util.tagger.tag(['I', 'just', 'drank', 'some', 'coffee', '.'])
[('I', 'PPSS'), ('just', 'RB'), ('drank', 'VBD'), ('some', 'DTI'), ('coffee', 'NN'), ('.', '.')]

And there you have it, POS tagged words of a sentence using your approach.

Upvotes: 2

Related Questions