S Gaber
S Gaber

Reputation: 1560

Part of speech for unknown and known words

what are the different between part of speech tagging for unknown words and part of speech tagging for known words. Is there any tool that can predict part of speech tagging for the words ..

Upvotes: 0

Views: 1763

Answers (2)

NQD
NQD

Reputation: 470

TnT tagger's paper presents an efficient approach for tagging unknown words.

Another approach using a lexicon to handle unknown words can be found in this article. The article shows that the lexicon-based approach obtains promising tagging results of unknown words in comparison to TnT's on 13 languages, including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. You can also find in the article accuracy results (for known words and unknown words) of TnT and other two POS and morphological taggers on the 13 languages.

Upvotes: 0

chenaren
chenaren

Reputation: 2258

One common way of handling the out-of-vocabulary words is replacing all words with low occurrence (e.g., frequency < 3) in the training corpus with the token *RARE*, so the tagger could roughly capture how to tag the rare words. Then in the testing phase, just treat every word not in the tagger's vocabulary as *RARE*.

An even simpler way is to tag every out-of-vocabulary word with the majority tag. The following code using nltk toolkit tags every unseen word as 'NN'.

tagger = nltk.UnigramTagger(trainingCorpus, backoff=nltk.DefaultTagger('NN'))

Upvotes: 4

Related Questions