DanielTheRocketMan
DanielTheRocketMan

Reputation: 3249

Understanding and using Coreference resolution Stanford NLP tool (in Python 3.7)

I am trying to understand the Coreference NLP Stanford tools. This is my code and it is working:

import os
os.environ["CORENLP_HOME"] = "/home/daniel/StanfordCoreNLP/stanford-corenlp-4.0.0"

from stanza.server import CoreNLPClient

text = 'When he came from Brazil, Daniel was fortified with letters from Conan but otherwise did not know a soul except Herbert. Yet this giant man from the Northeast, who had never worn an overcoat or experienced a change of seasons, did not seem surprised by his past.'

with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
               properties={'annotators': 'coref', 'coref.algorithm' : 'neural'},timeout=30000, memory='16G') as client:

    ann = client.annotate(text)

chains = ann.corefChain
chain_dict=dict()
for index_chain,chain in enumerate(chains):
    chain_dict[index_chain]={}
    chain_dict[index_chain]['ref']=''
    chain_dict[index_chain]['mentions']=[{'mentionID':mention.mentionID,
                                          'mentionType':mention.mentionType,
                                          'number':mention.number,
                                          'gender':mention.gender,
                                          'animacy':mention.animacy,
                                          'beginIndex':mention.beginIndex,
                                          'endIndex':mention.endIndex,
                                          'headIndex':mention.headIndex,
                                          'sentenceIndex':mention.sentenceIndex,
                                          'position':mention.position,
                                          'ref':'',
                                          } for mention in chain.mention ]


for k,v in chain_dict.items():
    print('key',k)
    mentions=v['mentions']
    for mention in mentions:
        words_list = ann.sentence[mention['sentenceIndex']].token[mention['beginIndex']:mention['endIndex']]
        mention['ref']=' '.join(t.word for t in words_list)
        print(mention['ref'])
    

I tried three algorithms:

  1. statistical (as in the code above). Results:
he
this giant man from the Northeast , who had never worn an overcoat or experienced a change of seasons
Daniel
his
  1. neural
this giant man from the Northeast , who had never worn an overcoat or experienced a change of seasons ,
his
  1. deterministic (I got the error below)

     > Starting server with command: java -Xmx16G -cp
     > /home/daniel/StanfordCoreNLP/stanford-corenlp-4.0.0/*
     > edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout
     > 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties
     > corenlp_server-9fedd1e9dfb14c9e.props -preload
     > tokenize,ssplit,pos,lemma,ner,parse,depparse,coref Traceback (most
     > recent call last):
     > 
     >   File "<ipython-input-58-0f665f07fd4d>", line 1, in <module>
     >     runfile('/home/daniel/Documentos/Working Papers/Leader traits/Code/20200704 - Modeling
     > Organizing/understanding_coreference.py',
     > wdir='/home/daniel/Documentos/Working Papers/Leader
     > traits/Code/20200704 - Modeling Organizing')
     > 
     >   File
     > "/home/daniel/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py",
     > line 827, in runfile
     >     execfile(filename, namespace)
     > 
     >   File
     > "/home/daniel/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py",
     > line 110, in execfile
     >     exec(compile(f.read(), filename, 'exec'), namespace)
     > 
     >   File "/home/daniel/Documentos/Working Papers/Leader
     > traits/Code/20200704 - Modeling
     > Organizing/understanding_coreference.py", line 21, in <module>
     >     ann = client.annotate(text)
     > 
     >   File
     > "/home/daniel/anaconda3/lib/python3.7/site-packages/stanza/server/client.py",
     > line 470, in annotate
     >     r = self._request(text.encode('utf-8'), request_properties, **kwargs)
     > 
     >   File
     > "/home/daniel/anaconda3/lib/python3.7/site-packages/stanza/server/client.py",
     > line 404, in _request
     >     raise AnnotationException(r.text)
     > 
     > AnnotationException: java.lang.RuntimeException:
     > java.lang.IllegalArgumentException: No enum constant
     > edu.stanford.nlp.coref.CorefProperties.CorefAlgorithmType.DETERMINISTIC
    

Questions:

  1. Why am I getting this error with the deterministic?

  2. Any piece of code using the NLP Stanford in Python seems to be much slower than the codes related with Spacy or NLTK. I know that there is no coreference in these other libraries. But for instance when I use import nltk.parse.stanford import StanfordDependencyParser for dependence parse it is much faster then this StanfordNLP library. Is there any way to acelerate this CoreNLPClient in Python?

  3. I will use this library to work with long texts. Is it better to work with smaller pieces with the entire text? Long texts can cause wrong results for coreference resolution (I have found very strange results for this coreference library when I am using long texts)? Is there an optimal size?

  4. Results:

The results from the statistical algorithm seems to be better. I expected that the best result would come from the neural algorithm. Do you agree with me? There are 4 valid mention in the statistical algorithm while only 2 when I am using the neural algorithm.

Am I missing something?

Upvotes: 2

Views: 1567

Answers (1)

smyskov
smyskov

Reputation: 126

  1. You may find the list of supported algorithms in Java documentation: link

  2. You might want to start the server and then just use it, something like

    # Here's the slowest part—models are being loaded
    client = CoreNLPClient(...)
    
    ann = client.annotate(text)
    
    ...
    
    client.stop()
    

But I cannot give you any clue regarding 3 and 4.

Upvotes: 1

Related Questions