LShi
LShi

Reputation: 1502

What's the difference between the 'originalText' and 'word' keys in a token?

When using CoreNLPParser from NLTK with CoreNLP Server, the resulting tokens contain both an 'originalText' key and a 'word' key.

What's the difference between the two? Is there any documentation about them?

I've only found this issue, which mentioned the origintalText key, but it doesn't answer my questions.

from nltk.parse.corenlp import CoreNLPParser 

corenlp_parser = CoreNLPParser('http://localhost:9000', encoding='utf8')
text = u'我家没有电脑。'

result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
print(result)

prints

{
   "sentences":[
      {
         "index":0,
         "tokens":[
            {
               "index":1,
               "word":"我家",
               "originalText":"我家",
               "characterOffsetBegin":0,
               "characterOffsetEnd":2
            },
            {
               "index":2,
               "word":"没有",
               "originalText":"没有",
               "characterOffsetBegin":2,
               "characterOffsetEnd":4
            },
            {
               "index":3,
               "word":"电脑",
               "originalText":"电脑",
               "characterOffsetBegin":4,
               "characterOffsetEnd":6
            },
            {
               "index":4,
               "word":"。",
               "originalText":"。",
               "characterOffsetBegin":6,
               "characterOffsetEnd":7
            }
         ]
      }
   ]
}

Update:

It seems the Token implements HasWord and HasOriginalText

Upvotes: 2

Views: 162

Answers (1)

Gabor Angeli
Gabor Angeli

Reputation: 5749

A word is transformed a little bit to make it, e.g., possible to print it in an S-Expression (i.e., a parse tree). So, parentheses and other braces become tokens like -LRB- (left round brace). In addition, quotes are normalized to be backticks (``) and forward ticks ('') and some other little things.

originalText, by contrast, is the literal original text of the token that can be used to reconstruct the original sentence.

Upvotes: 3

Related Questions