Reputation: 1560
what are the different between part of speech tagging for unknown words and part of speech tagging for known words. Is there any tool that can predict part of speech tagging for the words ..
Upvotes: 0
Views: 1763
Reputation: 470
TnT tagger's paper presents an efficient approach for tagging unknown words.
Another approach using a lexicon to handle unknown words can be found in this article. The article shows that the lexicon-based approach obtains promising tagging results of unknown words in comparison to TnT's on 13 languages, including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. You can also find in the article accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on the 13 languages.
Upvotes: 0
Reputation: 2258
One common way of handling the out-of-vocabulary words is replacing all words with low occurrence (e.g., frequency < 3) in the training corpus with the token *RARE*, so the tagger could roughly capture how to tag the rare words. Then in the testing phase, just treat every word not in the tagger's vocabulary as *RARE*.
An even simpler way is to tag every out-of-vocabulary word with the majority tag. The following code using nltk toolkit tags every unseen word as 'NN'.
tagger = nltk.UnigramTagger(trainingCorpus, backoff=nltk.DefaultTagger('NN'))
Upvotes: 4