Jatin Khurana
Jatin Khurana

Reputation: 1175

Unknown word handling in Part of speech Tagger

What is the correct way to apply the unknown word handling.....

I am confused with in the things like first I have to check that word starts with Capital or first to check for the suffix?

Should I gather the knowledge of Capitalize word being a noun from corpus or assign them Noun Tag blindly....

What would be better approached?

Upvotes: 0

Views: 1013

Answers (3)

NQD
NQD

Reputation: 470

This paper presents a simple lexicon-based approach for tagging unknown-words. It shows that the lexicon-based approach obtains promising tagging results of unknown words on 13 languages, including Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese.

In addition, you can also find in the paper accuracy results (for known words and unknown words) of 3 POS and morphological taggers on the 13 languages.

Upvotes: 0

tripleee
tripleee

Reputation: 189317

Your question is probably too broad to answer properly but given your level of abstraction, here are a few things to consider when deciding how "it depends".

Capitalization is not a good universal strategy because different languages have different capitalization norms. In German, every properly spelled Noun is written with a Capital Letter, whereas some languages do not distinguish between upper and lower case at all (and some scripts lack this distinction -- Arabic, Hebrew, Thai, Devanagari, not to mention Far Eastern scripts which of course are a completely different challenge altogether).

In English, obviously, capitalization is a good indicator that you are probably looking at a proper noun, but the absence of capitalization does not help you decide the correct POS at all.

Suffix matching is one of many possible categories for deciding the POS of an unknown word. Your choice of wording -- "the suffix" -- implies you have a very simplistic understanding of word formation. Some languages have suffix derivation and inflection but there are many other patterns. Swahili inflection uses prefixes, Arabic and Hebrew use infixes (which are however not marked orthographically), some languages mark plural through reduplication, etc.

Though it's no longer state of the art, a look at the Brill tagger is probably a good start for a better understanding of possible strategies.

A competing approach is to use syntactic constraints to disambiguate the role of each word. An application of constraint grammar is to use the POS tags of surrounding words to decide the most likely reading of an ambiguous or unknown word.

Upvotes: 2

langkilde
langkilde

Reputation: 1513

Are you trying to write your own POS-tagger?

If not, I suggest you use the Stanford POS-tagger, or some other open source software. It will attempt to assign each word in a sentence the correct POS-tag. You can download it here:

http://nlp.stanford.edu/software/tagger.shtml

Upvotes: 0

Related Questions