Colossal memory usage/stack problems with ANTLR lexer/parser

Question

I'm porting over a grammar from flex/bison, and mostly seem to have everything up and running (in particular, my token stream seems fine, and my parser grammar is compiling and running), but seem to be running into problems of runaway stack/memory usage even with very small/moderate sized inputs to my grammar. What is the preferred construct for chaining together an unbounded sequence of the same nonterminal? In my Bison grammar I had production rules of the form:

statements: statement | statement statements
words: | word words

In ANTLR, if I maintain the same rule setup, this seems to perform admirably on small inputs (on the order of 4kB), but leads to stack overflow on larger inputs (on the order of 100kB). In both cases the automated parse tree produced is also rather ungainly.

I experimented with changing these production rules to have an explicitly additive (rather than recursive form):

statements: statement+
words: word*

However this seems to have lead to absolutely horrific blowup in memory usage (upwards of 1GB) on even very small inputs, and the parser has not yet managed to return a parse tree after 20 minutes of letting it run.

Any pointers would be appreciated.

elfprince13 · Accepted Answer

Okay, so I've gotten it working, in the following manner. My YACC grammar had the following constructions:

lines: lines | line lines;
words: | word words;

However, this did not make the recursive parsing happy, so I rewrote it as:

lines: line+;
words: word*;

Which is in line with @280Z28's feedback (and my original guess). This hung the parser, which is why I posted the question in the first place, but the debugging procedure outlined in my comments to @280Z28's answer showed that in fact it was only the lines parsing which was causing the problem (words) was fine. On a whim, I tried the following rewrite:

lines   : stmt (EOL stmt)+ EOL*;

(where line had originally been defined as:

line : stmt (EOL | EOF);

)

This seems to be working quite well, even for large inputs. However it is entirely unclear to me WHY this is the Right Thing To Do(tm), or why it makes a difference compared to the revision which prompted this question. Any feedback on this matter would still be appreciated.

Colossal memory usage/stack problems with ANTLR lexer/parser

Answers (2)

Related Questions