Reputation: 795
I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.
Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?
Here is my code (please excuse the jruby...):
props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")
pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")
I am getting this as the output:
[Text=No CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=rn Lemma=no NamedEntityTag=O] [Text=sé CharacterOffsetBegin=3 CharacterOffsetEnd=5 PartOfSpeech=vmip000 Lemma=sé NamedEntityTag=O] [Text=qué CharacterOffsetBegin=6 CharacterOffsetEnd=9 PartOfSpeech=pt000000 Lemma=qué NamedEntityTag=O] [Text=estoy CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=vmip000 Lemma=estoy NamedEntityTag=O] [Text=haciendo CharacterOffsetBegin=16 CharacterOffsetEnd=24 PartOfSpeech=vmg0000 Lemma=haciendo NamedEntityTag=O] [Text=. CharacterOffsetBegin=24 CharacterOffsetEnd=25 PartOfSpeech=fp Lemma=. NamedEntityTag=O]
(I notice that the lemmas are incorrect also, but that's probably an issue for a separate question. Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)
Upvotes: 1
Views: 1662
Reputation: 470
If it is not strict to only using the Stanford POS tagger, you might want to try the POS and morphological tagging toolkit RDRPOSTagger. RDRPOSTagger supports pre-trained POS and morphological tagging to 40 different languages, including Spanish.
For Spanish POS and morphological tagging, RDRPOSTagger was trained using the IULA Spanish LSP Treebank. RDRPOSTagger then obtained a tagging accuracy of 97.95% with the tagging speed at 200K words/second in Java implementation (10K words/second in Python implementation), using a computer of Window7 OS 64-bit core i5 2.50GHz CPU and 6GB of memory.
Upvotes: 0
Reputation: 25592
Why does Stanford NLP only use a reduced version of the Ancora tag?
This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)
Is it possible to get the entire tag using Stanford NLP?
No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)
Upvotes: 1