Reputation: 4209

How to programmatically determine the Parts of Speech tag of a word?

Been wondering how to determine the POS tag of a word accurately. I played with POS taggers such as Stanford NLP etc, but they hit and miss as a word like "respond" is sometimes tagged as a NN (noun) when it is a verb (VB).

Would querying wordnet, or a dictionary dump be more accurate? Eg the word "respond" is a verb, and can also be a noun. Or perhaps infer from ngrams or add in a frequency based sanity check?

Upvotes: 2

Answers (3)

nmlq

Reputation: 3154

A POS Tagger is traditionally based on a probability distribution of words over a corpus. Therefore extending use-case to a new body of text will usually yield higher error rates, since the distribution of words is different.

Other models are not strictly a Probability Distribution, such as Neural Networks, and need to be trained but the same logic holds true for both.

For example, if I make a POS tagger for Shakespeare texts by using tagged sentences from Hamlet to define my probability distribution, then try to POS tag Biomedical texts, it probably won't perform well.

Therefore, to increase accuracy, you should train with a body of text that is similar to your specific domain.

The current best performing POS tagger in NLTK is the Perceptron Tagger which is the default and uses a pre-trained model. Here is how you would train your own model to increase accuracy.

import nltk,math
# get data to train and test
tagged_sentences = [sentence for sentence in nltk.corpus.brown.tagged_sents(categories='news',tagset='universal')]
# hold out 20% for testing, get index for 20% split
split_idx = math.floor(len(tagged_sentences)*0.2)
# testing sentences are words only, list(list(word))
testing_sentences = [[word for word,_ in test_sent] for test_sent in tagged_sentences[0:split_idx]]
# training sentences words and tags, list(list(word,tag))
training_sentences = tagged_sentences[split_idx:] 
# create instance of perceptron POS tagger
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(training_sentences)
pos_tagged_sentences = [perceptron_tagger.tag([word for word,_ in test_sentence]) for test_sentence in testing_sentences]

after perceptron_tagger.train() finishes on the training_sentences, you can use perceptron_tagger.tag() to get pos_tagged_sentences which are more useful to your domain and produce a much higher accuracy.

If done right, they will produce high-accuracy results. From my basic tests, they show the following results:

Metrics for <nltk.tag.perceptron.PerceptronTagger object at 0x7f34904d1748>
 Accuracy : 0.965636914654
 Precision: 0.965271747376
 Recall   : 0.965636914654
 F1-Score : 0.965368188021

Upvotes: 4

alexis

Reputation: 50220

POS tagging is a surprisingly hard problem, considering how easy it seems when a human does it. POS taggers have been written using many different approaches, and the Stanford tagger is among the best general-purpose taggers for English. (See here for a pretty authoritative comparison.) So if the approaches you suggest are any good--and some of them are--, they are in use already.

If you think you can build a better tagger, by all means give it a go; it will be a great learning experience. But don't be surprised if you can't beat a state of the art POS tagger at what it does.

Upvotes: 2

Violapterin

Reputation: 347

Have you tried TextBlob? One friend of mine was taking a linguistics course and they all use that to mark POS. This is a Python library. You may install by package manager pip.

$ pip install -U textblob

When using,

>> from textblob import TextBlob

There is a more detailed tutorial. You may also install their corpora NLTK. (I can't post link, but just search, tutorials exists in abundance)

Upvotes: -1

How to programmatically determine the Parts of Speech tag of a word?

Answers (3)

Related Questions