matt_07734
matt_07734

Reputation: 347

Speeding up Stanford Dependency Parses in Python

Is there a faster way to implement the CoreNLPParser or should I interact with the API through another library? Or should I dust off the Java books?

I have a corpus of 6500 sentences that I'm running through the CoreNLPParser method in nltk.parse.corenlp. I have isolated everything else I'm doing from my main project to test the tree_height function I wrote previously. However, the speed is the same--in fact, this process takes more than 15 minutes to complete.

Here's my tree_height function:

from nltk.parse.corenlp import CoreNLPParser
Parser = CoreNLPParser(url='http://localhost:9000')
def tree_height(tokenized_sent):
    ddep = Parser.raw_parse(tokenized_sent)
    for i in ddep:
        sent_height = i.height()
    return sent_height

I am parsing Spanish sentences and have previously started the CoreNLP server using the following command:

java -mx10g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000

I have also played with changing the mx3g part to mx5g, which doesn't seem to make a much of a difference.

I've seen this discussion on GitHub and am running a recent version of StanfordCoreNLP.

--- Update ---

I was concerned that the reason my script was performing slowly was because of inefficiencies or poorly written code--so here is what I've tried to do to find the inefficiencies with my code:

  1. Iterating over all the data (from a pandas dataframe) without calling any NLP functions takes about 20 seconds.
  2. Iterating over all the data and only sentence tokenizing all the data takes ~30 seconds
  3. In my latest attempt I added all the tokenized sentences to a variable and iteratively called the tree_height function on each one and found no difference in the speed (taking as long as before I started isolating code).

Upvotes: 3

Views: 733

Answers (2)

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

Ok so here is a description of a Python interface we are developing. To get the latest version you'll have to download from GitHub and follow the install instructions (which are easy to follow !!)

Go to GitHub and clone the Python interface repo:

https://github.com/stanfordnlp/python-stanford-corenlp

cd into the directory and type python setup.py install

(soon we'll set this up with conda and pip etc..., but for now it's still under development...you can get an older version on pip right now)

in a separate terminal window, start up a Java server:

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000

NOTE: make sure to have all of the necessary jars in your CLASSPATH or run with the -cp "*" option from a directory with all of the appropriate jars.

run this Python code:

import corenlp
client = corenlp.CoreNLPClient(start_server=False, annotators=["tokenize", "ssplit", "pos", "depparse"])
# there are other options for "output_format" such as "json"
# "conllu", "xml" and "serialized"
ann = client.annotate(u"...", output_format="text")

ann will contain the final annotated info (including the dependency parse)...this should be dramatically faster than what you are reporting...please try it out and let me know.

Upvotes: 2

berkin
berkin

Reputation: 558

Since parsing natural language can be complex, the experience that you had seems unexeptional. When you try to parse a simple sentence with spanishPCFG.ser.gz model in Stanford Parser Demo Interface http://nlp.stanford.edu:8080/parser/index.jsp it may take some miliseconds, however a long and complex sentence might take some seconds. You may give it a try, they supply statistics as well.

If you need to save time, you may try to parallelize your parsing task; that is all I can suggest.

The link that you supplied is a discussion on taggers that Steven Bird says it is resolved, by the way.

Cheers

Upvotes: 0

Related Questions