Reputation: 1502
When using CoreNLPParser
from NLTK
with CoreNLP Server, the resulting tokens contain both an 'originalText'
key and a 'word'
key.
What's the difference between the two? Is there any documentation about them?
I've only found this issue, which mentioned the origintalText
key, but it doesn't answer my questions.
from nltk.parse.corenlp import CoreNLPParser
corenlp_parser = CoreNLPParser('http://localhost:9000', encoding='utf8')
text = u'我家没有电脑。'
result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
print(result)
prints
{
"sentences":[
{
"index":0,
"tokens":[
{
"index":1,
"word":"我家",
"originalText":"我家",
"characterOffsetBegin":0,
"characterOffsetEnd":2
},
{
"index":2,
"word":"没有",
"originalText":"没有",
"characterOffsetBegin":2,
"characterOffsetEnd":4
},
{
"index":3,
"word":"电脑",
"originalText":"电脑",
"characterOffsetBegin":4,
"characterOffsetEnd":6
},
{
"index":4,
"word":"。",
"originalText":"。",
"characterOffsetBegin":6,
"characterOffsetEnd":7
}
]
}
]
}
Update:
It seems the Token
implements HasWord
and HasOriginalText
Upvotes: 2
Views: 162
Reputation: 5749
A word
is transformed a little bit to make it, e.g., possible to print it in an S-Expression (i.e., a parse tree). So, parentheses and other braces become tokens like -LRB-
(left round brace). In addition, quotes are normalized to be backticks (``) and forward ticks ('') and some other little things.
originalText
, by contrast, is the literal original text of the token that can be used to reconstruct the original sentence.
Upvotes: 3