Parsing messy texts with Stanford Parser

Question

I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?

One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.

Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?

Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?

Thanks!

Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)

Christopher Manning · Accepted Answer

This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.

The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.

Choices are:

Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.

Parsing messy texts with Stanford Parser

Answers (2)

Related Questions