alagris
alagris

Reputation: 2216

Antlr parser StackOverflowException (for parsing regular expressions)

I made a simple grammar for parsing regular expressions. Unfortunately, when I try to test my regex compiler on large expressions I reach StackOverflowException. The problem is similar to this one except that their solution no longer works in my scenario. Here is my grammar:

union: concat | concat '|' union ;
concat: kleene_closure concat | kleene_closure;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;

Now the problem is that I have a really large file that looks like

something1 | something2 | something3 | .... | something1000

I use ANTLR's Visitor class for parsing. I know I could probably make some optimization by using +/* like this

union: (concat '|')* concat ;
concat: kleene_closure+;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;

However, it doesn't really solve the problem, due to recursive nature of this grammar. For instance, it would now fail on the following sample that clearly requires recursion:

(...(((something1) | something2) | something3) | .... ) | something1000

How can I avoid StackOverflowExcpetion? How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?

Upvotes: 0

Views: 238

Answers (2)

Nikolay Handzhiyski
Nikolay Handzhiyski

Reputation: 1579

You can define your grammar in ABNF syntax and give it to TGS* to parser it iteratively - without the use of the thread dedicated stack for recursion. The parser generator generates parsers that run iteratively for all of its operations: lexing, parsing, tree construction, tree to string conversion, tree iteration and tree destruction.

The parser at runtime, can also give you the tree building information with events only, then you can build your tree as you want (or do any calculation without any tree). In this case, when you parse with events (a deterministic parser grammar without explicit tree building), if you have enough operative memory to hold the depth of the parsing rules, you can practically "stream" any input regardless of its size.

The deterministic grammar in ABNF (RFC 5234) like syntax is this:

alternative     = concatenation *('|' concatenation)
concatenation   = 1* kleene-closure
kleene-closure  = atomic 0*1 '*'
atomic          = '(' alternative ')' / letter
letter          = 'a'-'z' / 'A'-'Z'

This grammar however has one letter per item, and for input as "ab" you will get two atomic nodes with one letter per node. If you want to have more letters then maybe this grammar will do:

alternative     = concatenation *('|' *ws concatenation)
concatenation   = element *(ws 0*1 element)
element         = primary 0*1 '*'
primary         = '(' *ws alternative ')' / identifier
identifier      = 1*('a'-'z' / 'A'-'Z')
ws              = %x20 / %x9 / %xA / %xD

You can read that as: an alternative is made of one or more concatenations separated by |. A concatenation is one or more elements separated by at least one white space character. An element may end in * and can be an alternative in scopes or an identifier, which in turn is one or more letters. White space is space, tab, new line or carriage return. If you want to have more complex identifiers you may use this:

identifier      = (letter / '_') *(letter / '_' / digit)
letter          = 'a'-'z' / 'A'-'Z'
digit           = '0'-'9'

*I work on that project.

Upvotes: 1

rici
rici

Reputation: 241901

If you're going to use a recursive descent parser, then you will inevitably run into an input which exceeds the call stack depth. This problem is ameliorated by languages like Java which are capable of controlling their own stack depth, so that there is a controllable result like a StackOverflowException. But it's still a real problem.

Parser generators like Yacc/Bison and Java Cup use a bottom-up LALR(1) algorithm which uses an explicit stack for temporary storage, rather than using the call stack for that purpose. That means that the parsers have to manage storage for the parser stack (or use a container ADT from the host language's standard library, if there is one), which is slightly more complex. But you don't have to deal with that complexity; it's built in to the parser generator.

There are several advantages of the explicit stack for the parser generator:

  • It's easier to control maximum stack size;
  • The maximum stack size is (usually) only limited by available memory;
  • It's probably more memory efficient because control flow information doesn't need to be kept in stack frames.

Still, it's not a panacea. A sufficiently complicated expression will exceed any fixed stack size, and that can be lead to certain programs being unparseable. Furthermore, if you take advantage of the flexibility mentioned in the second point above ("only limited by available memory"), you may well find that your compiler is terminated unceremoniously by an OOM process (or a segfault) rather than being able to respond to a more polite out-of-memory exception (depending on OS and configuration, of course).

As to:

How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?

Having thousands of lines of code is not a problem if you use a repetition operator in your grammar (or, in the case that you are using an LALR(1) parser, that your grammar is left-recursive). The problem arises, as you note in your question, when you have texts with thousands of nested blocks. And the answer is that many C compilers don't deal gracefully with such texts. Here's a simple experiment with gcc:

$ # A function which generates deeply-nested C programs
$ type deep
deep is a function
deep () { 
    n=$1;
    printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
    for ((i=0; i<n; ++i))
    do
        printf '%*s{ int a%d = a%d + 1;\n' $((i+1)) '' $((i+1)) $i;
    done;
    printf '%*sprintf("%%d\\n", a%d);\n' $n '' $n;
    for ((i=0; i<n; ++i))
    do
        printf "%s" '}';
    done;
    printf "%s\n" '}'
}
$ deep 3
#include <stdio.h>
int main(void) {
 int a0 = 0;
 { int a1 = a0 + 1;
  { int a2 = a1 + 1;
   { int a3 = a2 + 1;
   printf("%d\n", a3);
}}}}
$ # For small depths, GCC is OK with that.
$ deep 3 | gcc -x c - && ./a.out
3
$ # Let's go deeper:
$ deep 10 | gcc -x c - && ./a.out
10
$ deep 100 | gcc -x c - && ./a.out
100
$ deep 1000 | gcc -x c - && ./a.out
1000
$ deep 10000 | gcc -x c - && ./a.out
10000
$ # Ka-bang. (Took quite a long time, too.)
$ deep 100000 | gcc -x c - && ./a.out
gcc: internal compiler error: Segmentation fault (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.

Without the nested blocks, gcc is still slow but can handle the program:

$ type big
big is a function
big () 
{ 
    n=$1;
    printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
    for ((i=0; i<n; ++i))
    do
        printf ' int a%d = a%d + 1;\n' $((i+1)) $i;
    done;
    printf ' printf("%%d\\n", a%d);\n' $n;
    printf "%s\n" '}'
}
$ big 3
#include <stdio.h>
int main(void) {
 int a0 = 0;
 int a1 = a0 + 1;
 int a2 = a1 + 1;
 int a3 = a2 + 1;
 printf("%d\n", a3);
}
$ $ big 3|gcc -x c - && ./a.out
3
$ big 10000|gcc -x c - && ./a.out
10000
$ big 100000|gcc -x c - && ./a.out
100000

Upvotes: 2

Related Questions