henryr
henryr

Reputation: 179

Getting character positions in outputs of stanfordNLP in coreference resolution

I'm trying to use the stanfordNLP for coreference resolution as it is explained here. I'm running the code of above (provided here):

from stanfordnlp.server import CoreNLPClient

text = 'Barack was born in Hawaii. His wife Michelle was born in Milan. He says that she is very smart.'
print(f"Input text: {text}")

# set up the client
client = CoreNLPClient(properties={'annotators': 'coref', 'coref.algorithm' : 'statistical'}, timeout=60000, memory='16G')

# submit the request to the server
ann = client.annotate(text)    

mychains = list()
chains = ann.corefChain
for chain in chains:
    mychain = list()
    # Loop through every mention of this chain
    for mention in chain.mention:
        # Get the sentence in which this mention is located, and get the words which are part of this mention
        # (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
        words_list = ann.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
        #build a string out of the words of this mention
        ment_word = ' '.join([x.word for x in words_list])
        mychain.append(ment_word)
    mychains.append(mychain)

for chain in mychains:
    print(' <-> '.join(chain))

After installing the library:

pip3 install stanfordcorenlp

downloading the models,

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

and setting the $CORENLP_HOME variable,

os.environ['CORENLP_HOME'] = "path/to/stanford-corenlp-full-2018-10-05"

This code works pretty well for me, however, the output only contains information by tokens instead of characters. For example, for the above code, the output is:

Barack <-> His <-> He
His wife Michelle <-> she

printing the variable mention inside the buckle is:

mentionID: 0
mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 0
position: 1

mentionID: 4
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 1
position: 3

mentionID: 5
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 1
headIndex: 0
sentenceIndex: 2
position: 1

mentionID: 3
mentionType: "PROPER"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 0
endIndex: 3
headIndex: 2
sentenceIndex: 1
position: 2

mentionID: 6
mentionType: "PRONOMINAL"
number: "SINGULAR"
gender: "FEMALE"
animacy: "ANIMATE"
beginIndex: 3
endIndex: 4
headIndex: 3
sentenceIndex: 2
position: 2

I was searching for other attributes, for example, printing ann.mentionsForCoref,

mentionType: "PROPER"
number: "SINGULAR"
gender: "MALE"
animacy: "ANIMATE"
person: "UNKNOWN"
startIndex: 0
endIndex: 1
headIndex: 0
headString: "barack"
nerString: "PERSON"
originalRef: 4294967295
goldCorefClusterID: -1
corefClusterID: 5
mentionNum: 0
sentNum: 0
utter: 0
paragraph: 1
isSubject: false
isDirectObject: true
isIndirectObject: false
isPrepositionObject: false
hasTwin: false
generic: false
isSingleton: false
hasBasicDependency: true
hasEnhancedDepenedncy: true
hasContextParseTree: true

Despite the great information provided by this attribute, there is no information about the character position of the words. I could split the sentences by spaces, but it is not general, I think that could be cases that it can fail. Can anyone help me with that??

Upvotes: 0

Views: 284

Answers (1)

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

Try adding output_format='json' when you build the client. The JSON data should have the character offset info of each token.

There is info here about using the client:

https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html

Upvotes: 1

Related Questions