Oliver Dain
Oliver Dain

Reputation: 9953

Part of speech tagged as "word"

I'm using the Stanford Part of Speech tagger on some Spanish text. As per their docs the part of speech tags come from this set: http://nlp.stanford.edu/software/spanish-faq.shtml#tagset

Overall, I've found this to be accurate and haven't had an issue. However, I just ran into a small snippet of text: "Adiós ~ hailey". This is tagged as follows: Adiós_i ~_word hailey_aq0000. So the ~ symbol, which I think should get a punctuation tag of f0 got a tag of word. That isn't documented or expected. Is this a bug or expected?

Update

It turns out the special "word" tag appears in other contexts as well. I just saw it for the word it and the word á.

Upvotes: 1

Views: 136

Answers (1)

Jon Gauthier
Jon Gauthier

Reputation: 25572

Thanks for catching this! I've been a bit slow to catch up on documentation.. I just updated the tag list in our documentation to include the new word.

In the CoreNLP 3.7.0 release, we included new Spanish models trained on extra data (specifically, the DEFT Spanish Treebank V2). Some of the new data comes from a discussion forum dataset (Latin American Spanish Discussion Forum Treebank). This dataset uses an extra POS tag, word, to label emoticons and miscellaneous symbols (e.g. the ® sign).

(I know, it's a sort of silly choice of name — but we wanted to stick with what the original corpus used.)

Upvotes: 1

Related Questions