Reputation: 14823
Is it possible to speed up batch processing of documents with CoreNLP from command line so that models load only one time? I would like to trim any unnecessarily repeated steps from the process.
I have 320,000 text files and I am trying to process them with CoreNLP. The desired result is 320,000 finished XML file results.
To get from one text file to one XML file, I use the CoreNLP jar file from command line:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
-file %%~f -outputDirectory MyOutput -outputExtension .xml -replaceExtension`
This loads models and does a variety of machine learning magic. The problem I face is when I try to loop for every text in a directory, I create a process that by my estimation will complete in 44 days. I literally have had a command prompt looping on my desktop for the last 7 days and I'm nowhere near finished. The loop I run from batch script:
for %%f in (Data\*.txt) do (
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
-file %%~f -outputDirectory Output -outputExtension .xml -replaceExtension
)
I am using these annotators, specified in config.properties:
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment
Upvotes: 2
Views: 1342
Reputation: 67206
I know nothing about Stanford CoreNLP, so I googled for it (you didn't included any link) and in this page I found this description (below "Parsing a file and saving the output as XML"):
If you want to process a list of files use the following command line:
java -cp stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR CONFIGURATION FILE ] -filelist A FILE CONTAINING YOUR LIST OF FILES
where the -filelist parameter points to a file whose content lists all files to be processed (one per line).
So I guess that you may process your files faster if you store a list of all your text files in a list file:
dir /B *.txt > list.lst
... and then pass that list in the -filelist list.lst
parameter in a single execution of Stanford CoreNLP.
Upvotes: 3