What's the difference between the 'originalText' and 'word' keys in a token?

Question

When using CoreNLPParser from NLTK with CoreNLP Server, the resulting tokens contain both an 'originalText' key and a 'word' key.

What's the difference between the two? Is there any documentation about them?

I've only found this issue, which mentioned the origintalText key, but it doesn't answer my questions.

from nltk.parse.corenlp import CoreNLPParser 

corenlp_parser = CoreNLPParser('http://localhost:9000', encoding='utf8')
text = u'我家没有电脑。'

result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
print(result)

prints

{
   "sentences":[
      {
         "index":0,
         "tokens":[
            {
               "index":1,
               "word":"我家",
               "originalText":"我家",
               "characterOffsetBegin":0,
               "characterOffsetEnd":2
            },
            {
               "index":2,
               "word":"没有",
               "originalText":"没有",
               "characterOffsetBegin":2,
               "characterOffsetEnd":4
            },
            {
               "index":3,
               "word":"电脑",
               "originalText":"电脑",
               "characterOffsetBegin":4,
               "characterOffsetEnd":6
            },
            {
               "index":4,
               "word":"。",
               "originalText":"。",
               "characterOffsetBegin":6,
               "characterOffsetEnd":7
            }
         ]
      }
   ]
}

Update:

It seems the Token implements HasWord and HasOriginalText

Gabor Angeli · Accepted Answer

A word is transformed a little bit to make it, e.g., possible to print it in an S-Expression (i.e., a parse tree). So, parentheses and other braces become tokens like -LRB- (left round brace). In addition, quotes are normalized to be backticks (``) and forward ticks ('') and some other little things.

originalText, by contrast, is the literal original text of the token that can be used to reconstruct the original sentence.

What's the difference between the 'originalText' and 'word' keys in a token?

Answers (1)

Related Questions

What&#39;s the difference between the &#39;originalText&#39; and &#39;word&#39; keys in a token?

Answers (1)

Related Questions

What's the difference between the 'originalText' and 'word' keys in a token?