Reputation: 285
I am tagging Spanish text with the Stanford POS Tagger (via NLTK in Python).
Here is my code:
import nltk
from nltk.tag.stanford import POSTagger
spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar')
spanish_postagger.tag('esta es una oracion de prueba'.split())
The result is:
[(u'esta', u'pd000000'),
(u'es', u'vsip000'),
(u'una', u'di0000'),
(u'oracion', u'nc0s000'),
(u'de', u'sp000'),
(u'prueba', u'nc0s000')]
I want to know where can I found what exactly means pd000000, vsip000, di0000, nc0s000, sp000?
Upvotes: 6
Views: 3790
Reputation: 25592
This is a simplified version of the tagset used in the AnCora treebank. You can find their tagset documentation here: https://web.archive.org/web/20160325024315/http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html
The "simplification" consists of nulling out many of the final fields which don't strictly belong in a part-of-speech tag. For example, our part-of-speech tagger will always give you null (0
) values for the NER field of the original tagset (see EAGLES noun documentation).
In short: the fields in the POS tags produced by our tagger correspond exactly to AnCora POS fields, but a lot of those fields will be null. For most practical purposes you'll only need to look at the first 2–4 characters of the tag. The first character always indicates the broad POS category, and the second character indicates some kind of subtype.
We're in the process of writing some introductory documentation for using Spanish with CoreNLP (that means understanding these tags, and much else) right now. For the moment, you can find more information on the first page of our technical documentation.
Upvotes: 10