Reputation: 9953
I'm using the Stanford Part of Speech tagger on some Spanish text. As per their docs the part of speech tags come from this set: http://nlp.stanford.edu/software/spanish-faq.shtml#tagset
Overall, I've found this to be accurate and haven't had an issue. However, I just ran into a small snippet of text: "Adiós ~ hailey". This is tagged as follows: Adiós_i ~_word hailey_aq0000
. So the ~
symbol, which I think should get a punctuation tag of f0
got a tag of word
. That isn't documented or expected. Is this a bug or expected?
It turns out the special "word" tag appears in other contexts as well. I just saw it for the word it
and the word á
.
Upvotes: 1
Views: 136
Reputation: 25572
Thanks for catching this! I've been a bit slow to catch up on documentation.. I just updated the tag list in our documentation to include the new word
.
In the CoreNLP 3.7.0 release, we included new Spanish models trained on extra data (specifically, the DEFT Spanish Treebank V2). Some of the new data comes from a discussion forum dataset (Latin American Spanish Discussion Forum Treebank). This dataset uses an extra POS tag, word
, to label emoticons and miscellaneous symbols (e.g. the ® sign).
(I know, it's a sort of silly choice of name — but we wanted to stick with what the original corpus used.)
Upvotes: 1