Reputation: 1
I am using the combined tagger described in the nltk book - chapter 5
Here is the code
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
Since the default tagger tags every token to NN
every token that is goes to t0 will be tagged NN
they say this can be resolved by following below method
Our approach to tagging unknown words still uses backoff to a regular-expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items?
A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK using the method shown in 3. During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.
I have written the method shown in 3 that maps every word to UNK
>>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> vocab = nltk.FreqDist(alice)
>>> v1000 = [word for (word, _) in vocab.most_common(1000)]
>>> mapping = defaultdict(lambda: 'UNK')
>>> for v in v1000:
... mapping[v] = v
...
>>> alice2 = [mapping[v] for v in alice]
>>> alice2[:100]
['UNK', 'Alice', "'", 's', 'UNK', 'in', 'UNK', 'by', 'UNK', 'UNK', 'UNK',
'UNK', 'CHAPTER', 'I', '.', 'UNK', 'the', 'Rabbit', '-', 'UNK', 'Alice',
'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by',
'her', 'sister', 'on', 'the', 'UNK', ',', 'and', 'of', 'having', 'nothing',
'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'UNK', 'into', 'the',
'book', 'her', 'sister', 'was', 'UNK', ',', 'but', 'it', 'had', 'no',
'pictures', 'or', 'UNK', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the',
'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without',
'pictures', 'or', 'conversation', "?'" ...]
>>> len(set(alice2))
1001
My question is how do we implement this method in combined taggers? Where do I put the new mapped dictionary (in this example mapping
) in combined taggers?
Upvotes: 0
Views: 1774
Reputation: 189317
You should be replacing the tags, not the words themselves. Based on the code you shared, something like
mapped_unk = [(w[0], 'UNK') if i%2 else (w[0], w[1]) for i, w in enumerate(tagged)]
where I assume tagged
is a tagged version of alice
such that each input word
has been mapped to a tuple (word, tag)
like [('Alice', 'NN'), ('tagged', 'VT'), ('her', 'PRON'), ('corpus', 'N'), ('.', 'PUNC')]
(with some guesswork around your tagging conventions).
nltk
has routines for tagging a piece of text, but it's better if you can find a human-vetted tagged text to train on; the included corpora feature a number of such texts. The tagged training set should obviously use the tagging conventions (tag set, tokenization, etc) you desire for the trained tagger to learn; clearly the availability of tagged corpora to train on will constrain your choices if you can't manually produce a large enough tagged corpus for your preferred conventions.
Upvotes: 1