Reputation: 86
I'm porting over a grammar from flex/bison, and mostly seem to have everything up and running (in particular, my token stream seems fine, and my parser grammar is compiling and running), but seem to be running into problems of runaway stack/memory usage even with very small/moderate sized inputs to my grammar. What is the preferred construct for chaining together an unbounded sequence of the same nonterminal? In my Bison grammar I had production rules of the form:
statements: statement | statement statements
words: | word words
In ANTLR, if I maintain the same rule setup, this seems to perform admirably on small inputs (on the order of 4kB), but leads to stack overflow on larger inputs (on the order of 100kB). In both cases the automated parse tree produced is also rather ungainly.
I experimented with changing these production rules to have an explicitly additive (rather than recursive form):
statements: statement+
words: word*
However this seems to have lead to absolutely horrific blowup in memory usage (upwards of 1GB) on even very small inputs, and the parser has not yet managed to return a parse tree after 20 minutes of letting it run.
Any pointers would be appreciated.
Upvotes: 1
Views: 1195
Reputation: 86
Okay, so I've gotten it working, in the following manner. My YACC grammar had the following constructions:
lines: lines | line lines;
words: | word words;
However, this did not make the recursive parsing happy, so I rewrote it as:
lines: line+;
words: word*;
Which is in line with @280Z28's feedback (and my original guess). This hung the parser, which is why I posted the question in the first place, but the debugging procedure outlined in my comments to @280Z28's answer showed that in fact it was only the lines
parsing which was causing the problem (words
) was fine. On a whim, I tried the following rewrite:
lines : stmt (EOL stmt)+ EOL*;
(where line
had originally been defined as:
line : stmt (EOL | EOF);
)
This seems to be working quite well, even for large inputs. However it is entirely unclear to me WHY this is the Right Thing To Do(tm), or why it makes a difference compared to the revision which prompted this question. Any feedback on this matter would still be appreciated.
Upvotes: 1
Reputation: 99869
Your rewritten statements are the optimal ANTLR 4 form of the two rules you described (highest performing and minimum memory usage). Here is some general feedback regarding the issues you describe.
I developed some very advanced diagnostic code for numerous potential performance problems. Much of this code is included in TestPerformance
, but it is geared towards expert users and requires a rather deep understanding of ANTLR 4's new ALL(*) algorithm to interpret the results.
Terence and I are interested in turning the above into a tool that users can make use of. I may be able to help (run and interpret the test) if you provide a complete grammar and example inputs, so that I can use that grammar and input pair as part of evaluating the usability of a tool further down the road that automates the analysis.
Make sure you are using the two-stage parsing strategy from the book. In many cases, this will vastly improve the parsing performance for correct inputs (incorrect inputs would not be faster).
We don't like to use more memory than necessary, but you should be aware that we are working under a very different definition of "excessive" - e.g. we run our testing applications with -Xmx4g
to -Xmx12g
, depending on the test.
Upvotes: 2