Tesseract_R
Tesseract_R

Reputation: 39

Parsing messy texts with Stanford Parser

I am running Stanford Parser on a large chunk of texts. The parser terminates when it hits a sentence it cannot parse, and gives the following runtime error. Is there a way to make Stanford Parser ignore the error, and move on to parsing the next sentence?

One way is to break down the text into a myriad of one-sentence documents, and parse each document and record the output. However, this involves loading the Stanford Parser many many times (each time a document is parsed, the Stanford Parser has to be reloaded). Loading the parser takes a lot of time, but parsing takes much shorter time. It would be great to find a way to avoid having to reload the parser on every sentence.

Another solution might be to reload the parser once it hits an error, and picking up the texts where it stopped and continue parsing from there. Does anyone know of a good way to implements this solution?

Last but not least, does there exist any Java wrapper that ignores errors and keeps a Java program running until the program terminates naturally?

Thanks!

Exception in thread "main" java.lang.RuntimeException: CANNOT EVEN CREATE ARRAYS OF ORIGINAL SIZE!!
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.considerCreatingArrays(ExhaustivePCFGParser.java:2190)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:347)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseAndReport(LexicalizedParserQuery.java:585)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:213)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:73)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1535)

Upvotes: 0

Views: 953

Answers (2)

Christopher Manning
Christopher Manning

Reputation: 9450

This error is basically an out of memory error. It likely occurs because there are long stretches of text with no sentence terminating punctuation (periods, question marks), and so it has been and is trying to parse a huge list of words that it regards as a single sentence.

The parser in general tries to continue after a parse failure, but can't in this case because it both failed to create data structures for parsing a longer sentence and then failed to recreate the data structures it was using previously. So, you need to do something.

Choices are:

  • Indicate sentence/short document boundaries yourself. This does not require loading the parser many times (and you should avoid that). From the command-line you can put each sentence in a file and give the parser many documents to parse and ask it to save them in different files (See the -writeOutputFiles option).
  • Alternatively (and perhaps better) you can do this keeping everything in one file by either making the sentences one per line, or using simple XML/SGML style tags surrounding each sentence and then to use the -sentences newline or -parseInside ELEMENT.
  • Or you can just avoid this problem by specifying a maximum sentence length. Longer things that are not sentence divided will be skipped. (This is great for runtime too!) You can do this with -maxLength 80.
  • If you are writing your own program, you could catch this Exception and try to resume. But it will only be successful if sufficient memory is available, unless you take the steps in the earlier bullet points.

Upvotes: 4

Rahul
Rahul

Reputation: 132

Parser are known to be slow. You can try using shallow parser which will be relatively faster then full blown version. If you just need POS tag then consider using tagger. Create a static instance of parser and use it over and over rather then reloading. <>Q-24

Upvotes: 0

Related Questions