Meaning of Stanford Spanish POS Tagger tags

Question

I am tagging Spanish text with the Stanford POS Tagger (via NLTK in Python).

Here is my code:

import nltk
from nltk.tag.stanford import POSTagger
spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar')
spanish_postagger.tag('esta es una oracion de prueba'.split())

The result is:

[(u'esta', u'pd000000'),
(u'es', u'vsip000'),
(u'una', u'di0000'),
(u'oracion', u'nc0s000'),
(u'de', u'sp000'),
(u'prueba', u'nc0s000')]

I want to know where can I found what exactly means pd000000, vsip000, di0000, nc0s000, sp000?

Jon Gauthier · Accepted Answer

This is a simplified version of the tagset used in the AnCora treebank. You can find their tagset documentation here: https://web.archive.org/web/20160325024315/http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

The "simplification" consists of nulling out many of the final fields which don't strictly belong in a part-of-speech tag. For example, our part-of-speech tagger will always give you null (0) values for the NER field of the original tagset (see EAGLES noun documentation).

In short: the fields in the POS tags produced by our tagger correspond exactly to AnCora POS fields, but a lot of those fields will be null. For most practical purposes you'll only need to look at the first 2–4 characters of the tag. The first character always indicates the broad POS category, and the second character indicates some kind of subtype.

We're in the process of writing some introductory documentation for using Spanish with CoreNLP (that means understanding these tags, and much else) right now. For the moment, you can find more information on the first page of our technical documentation.

Meaning of Stanford Spanish POS Tagger tags

Answers (1)

Related Questions