kaulmonish
kaulmonish

Reputation: 408

Stanford Parser - MultiThreading issue - LexicalizedParser

Firstly, parsing is running smooth on small set of sentences - In order of 200ms to 1s - depending on the sentence size.

What do I want to achieve?

I want to parse 50L sentences in 1-2 hours.

Somehow, I need to convert this ->

            for(String sentence: sentences){
               Tree parsed = AnalysisUtilities.getInstance().parseSentence(job).parse;
            }

into multithreaded calls. I wrote a multi threaded executor to do this, which looks like this ->

                MultiThreadExecutor<String> mte = new MultiThreadExecutor<String>(2, new JobExecutor<String>() {
                @Override
                public void executeJob(String job) {
                    Tree parsed = AnalysisUtilities.getInstance().parseSentence(job).parse;
                    inputTrees.add(parsed);
                }
            }, "");


            for(String sentence: sentences){
                mte.addJob(sentence);
            }

It works fine on one thread, but as soon as I give multiple threads it breaks with a exception inside the Stanford parse function. Exception looks like this ->

java.lang.ArrayIndexOutOfBoundsException: 3 at java.util.ArrayList.add(ArrayList.java:441) at edu.stanford.nlp.parser.lexparser.BaseLexicon.initRulesWithWord(BaseLexicon.java:300) at edu.stanford.nlp.parser.lexparser.BaseLexicon.isKnown(BaseLexicon.java:160) at edu.stanford.nlp.parser.lexparser.BaseLexicon.ruleIteratorByWord(BaseLexicon.java:212) at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.initializeChart(ExhaustivePCFGParser.java:1299) at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:388) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:234) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189) at edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:147) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:144) at edu.cmu.ark.MultiThreadExecutor$1.run(MultiThreadExecutor.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.RuntimeException: Dependencies not equal: "Spacious/CD" -> ".*./CC" left 0 and "Spacious/CD" -> "easy/RB" right 1 at edu.stanford.nlp.parser.lexparser.MLEDependencyGrammar.probTB(MLEDependencyGrammar.java:586) at edu.stanford.nlp.parser.lexparser.MLEDependencyGrammar.scoreTB(MLEDependencyGrammar.java:511) at edu.stanford.nlp.parser.lexparser.AbstractDependencyGrammar.scoreTB(AbstractDependencyGrammar.java:229) at edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser.parse(ExhaustiveDependencyParser.java:322) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:244) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189) at edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:147) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:144) at edu.cmu.ark.MultiThreadExecutor$1.run(MultiThreadExecutor.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.lang.NullPointerException at edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.projectHooks(BiLexPCFGParser.java:342) at edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.processEdge(BiLexPCFGParser.java:546) at edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.processItem(BiLexPCFGParser.java:571) at edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.parse(BiLexPCFGParser.java:854) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:255) at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189) at edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:147) at edu.cmu.ark.QuestionAsker$1.executeJob(QuestionAsker.java:144) at edu.cmu.ark.MultiThreadExecutor$1.run(MultiThreadExecutor.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Is there any way to do it ? I can relate to a previously asked question but to no good.

Upvotes: 1

Views: 231

Answers (1)

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

Here is an example command that will run the parser in multi-threaded mode:

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse -parse.nthreads 4 -ssplit.eolonly -file some-sentences.txt -outputFormat text

Upvotes: 1

Related Questions