ThisClark
ThisClark

Reputation: 14823

Efficient batch processing with Stanford CoreNLP

Is it possible to speed up batch processing of documents with CoreNLP from command line so that models load only one time? I would like to trim any unnecessarily repeated steps from the process.

I have 320,000 text files and I am trying to process them with CoreNLP. The desired result is 320,000 finished XML file results.

To get from one text file to one XML file, I use the CoreNLP jar file from command line:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties 
-file %%~f -outputDirectory MyOutput -outputExtension .xml -replaceExtension`

This loads models and does a variety of machine learning magic. The problem I face is when I try to loop for every text in a directory, I create a process that by my estimation will complete in 44 days. I literally have had a command prompt looping on my desktop for the last 7 days and I'm nowhere near finished. The loop I run from batch script:

for %%f in (Data\*.txt) do (
    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
    -file %%~f -outputDirectory Output -outputExtension .xml -replaceExtension
)

I am using these annotators, specified in config.properties:
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment

Upvotes: 2

Views: 1342

Answers (1)

Aacini
Aacini

Reputation: 67206

I know nothing about Stanford CoreNLP, so I googled for it (you didn't included any link) and in this page I found this description (below "Parsing a file and saving the output as XML"):

If you want to process a list of files use the following command line:

java -cp stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR CONFIGURATION FILE ] -filelist A FILE CONTAINING YOUR LIST OF FILES

where the -filelist parameter points to a file whose content lists all files to be processed (one per line).

So I guess that you may process your files faster if you store a list of all your text files in a list file:

dir /B *.txt > list.lst

... and then pass that list in the -filelist list.lst parameter in a single execution of Stanford CoreNLP.

Upvotes: 3

Related Questions