Reputation: 347
Is there a faster way to implement the CoreNLPParser
or should I interact with the API through another library? Or should I dust off the Java books?
I have a corpus of 6500 sentences that I'm running through the CoreNLPParser
method in nltk.parse.corenlp
. I have isolated everything else I'm doing from my main project to test the tree_height
function I wrote previously. However, the speed is the same--in fact, this process takes more than 15 minutes to complete.
Here's my tree_height
function:
from nltk.parse.corenlp import CoreNLPParser
Parser = CoreNLPParser(url='http://localhost:9000')
def tree_height(tokenized_sent):
ddep = Parser.raw_parse(tokenized_sent)
for i in ddep:
sent_height = i.height()
return sent_height
I am parsing Spanish sentences and have previously started the CoreNLP server using the following command:
java -mx10g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000
I have also played with changing the mx3g
part to mx5g
, which doesn't seem to make a much of a difference.
I've seen this discussion on GitHub and am running a recent version of StanfordCoreNLP.
--- Update ---
I was concerned that the reason my script was performing slowly was because of inefficiencies or poorly written code--so here is what I've tried to do to find the inefficiencies with my code:
tree_height
function on each one and found no difference in the speed (taking as long as before I started isolating code).Upvotes: 3
Views: 733
Reputation: 8739
Ok so here is a description of a Python interface we are developing. To get the latest version you'll have to download from GitHub and follow the install instructions (which are easy to follow !!)
Go to GitHub and clone the Python interface repo:
https://github.com/stanfordnlp/python-stanford-corenlp
cd into the directory and type python setup.py install
(soon we'll set this up with conda
and pip
etc..., but for now it's still under development...you can get an older version on pip right now)
in a separate terminal window, start up a Java server:
java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000
NOTE: make sure to have all of the necessary jars in your CLASSPATH
or run with the -cp "*"
option from a directory with all of the appropriate jars.
run this Python code:
import corenlp
client = corenlp.CoreNLPClient(start_server=False, annotators=["tokenize", "ssplit", "pos", "depparse"])
# there are other options for "output_format" such as "json"
# "conllu", "xml" and "serialized"
ann = client.annotate(u"...", output_format="text")
ann
will contain the final annotated info (including the dependency parse)...this should be dramatically faster than what you are reporting...please try it out and let me know.
Upvotes: 2
Reputation: 558
Since parsing natural language can be complex, the experience that you had seems unexeptional. When you try to parse a simple sentence with spanishPCFG.ser.gz model in Stanford Parser Demo Interface http://nlp.stanford.edu:8080/parser/index.jsp it may take some miliseconds, however a long and complex sentence might take some seconds. You may give it a try, they supply statistics as well.
If you need to save time, you may try to parallelize your parsing task; that is all I can suggest.
The link that you supplied is a discussion on taggers that Steven Bird says it is resolved, by the way.
Cheers
Upvotes: 0