hasindu-s
hasindu-s

Reputation: 1

How to tag unknown words (Tokens with tag UNK) in combined taggers

I am using the combined tagger described in the nltk book - chapter 5

Here is the code

t0 = nltk.DefaultTagger('NN')

t1 = nltk.UnigramTagger(train_sents, backoff=t0)

t2 = nltk.BigramTagger(train_sents, backoff=t1)
 

Since the default tagger tags every token to NN every token that is goes to t0 will be tagged NN they say this can be resolved by following below method

Our approach to tagging unknown words still uses backoff to a regular-expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items?

A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK using the method shown in 3. During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.

I have written the method shown in 3 that maps every word to UNK

>>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> vocab = nltk.FreqDist(alice)
>>> v1000 = [word for (word, _) in vocab.most_common(1000)]
>>> mapping = defaultdict(lambda: 'UNK')
>>> for v in v1000:
...     mapping[v] = v
...
>>> alice2 = [mapping[v] for v in alice]
>>> alice2[:100]
['UNK', 'Alice', "'", 's', 'UNK', 'in', 'UNK', 'by', 'UNK', 'UNK', 'UNK',
'UNK', 'CHAPTER', 'I', '.', 'UNK', 'the', 'Rabbit', '-', 'UNK', 'Alice',
'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by',
'her', 'sister', 'on', 'the', 'UNK', ',', 'and', 'of', 'having', 'nothing',
'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'UNK', 'into', 'the',
'book', 'her', 'sister', 'was', 'UNK', ',', 'but', 'it', 'had', 'no',
'pictures', 'or', 'UNK', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the',
'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without',
'pictures', 'or', 'conversation', "?'" ...]
>>> len(set(alice2))
1001

My question is how do we implement this method in combined taggers? Where do I put the new mapped dictionary (in this example mapping) in combined taggers?

Upvotes: 0

Views: 1774

Answers (1)

tripleee
tripleee

Reputation: 189317

You should be replacing the tags, not the words themselves. Based on the code you shared, something like

mapped_unk = [(w[0], 'UNK') if i%2 else (w[0], w[1]) for i, w in enumerate(tagged)]

where I assume tagged is a tagged version of alice such that each input word has been mapped to a tuple (word, tag) like [('Alice', 'NN'), ('tagged', 'VT'), ('her', 'PRON'), ('corpus', 'N'), ('.', 'PUNC')] (with some guesswork around your tagging conventions).

nltk has routines for tagging a piece of text, but it's better if you can find a human-vetted tagged text to train on; the included corpora feature a number of such texts. The tagged training set should obviously use the tagging conventions (tag set, tokenization, etc) you desire for the trained tagger to learn; clearly the availability of tagged corpora to train on will constrain your choices if you can't manually produce a large enough tagged corpus for your preferred conventions.

Upvotes: 1

Related Questions