Owen Allen
Owen Allen

Reputation: 11968

Matching but ignoring nested parentheses with JISON

I'm working on a grammar for a templating system. I've hit a snag in the build and I can't quite figure out how to solve this issue. I've simplified down the test case to best emphasize exactly what I'm doing.

Example Strings:

The rules are that within a parenthetical, anything goes, any characters. I don't need to validate, and I don't need to ensure they match a proper format. On the other hand, from my understanding, in order for the parser to function I do need keep track of opening and closing ( and ) otherwise the lexer can't know where one parenthetical statement begins and another ends, such as (foo()) (bar). In order to keep track of that I'm using a paren start condition which increments a value whenever a paren is hit inside a paren statement, and removes it when a close paren is it.

The problem is it's just not working. The main culprit is it never appears to hit my <paren>")" rule and yet I'm hitting the <paren>"(" rule just fine. They appear syntactically the same, why is one working and the other not?

Grammar

%lex

%x paren

%%

\s+                   /* skip whitespace */
<INITIAL>"("         { this.begin("paren"); parenCount = 1; return "parenStart"; };
<paren>"("            { console.log("parenStart", parenCount); parenCount++; return "parenInterior"; };
<paren>")"            { console.log("parenEnd", parenCount); parenCount--; if (parenCount === 0) { this.popState(); return "parenEnd"; } else { return "parenInterior"; } };
<paren>[^\)\(]+       { console.log(this); return "parenInterior"; };
<<EOF>>               return 'EOF';
.                     return 'INVALID';

/lex

%start expressions

%% /* language grammar */

expressions
    : parenStart parenInterior parenEnd { return $1 + $2 + $3; }
    ;

%%

parenCount = 0;

Upvotes: 0

Views: 282

Answers (1)

Louis
Louis

Reputation: 151441

I believe your problem is that your grammar is not accepting a sequence of tokens. If I change you grammar to this, then I get something that can handle the strings you've shown in your question:

%lex

%x paren

%%

\s+                   /* skip whitespace */
<INITIAL>"("         { this.begin("paren"); parenCount = 1; return "parenStart"; };
<paren>"("            { console.log("parenStart", parenCount); parenCount++; return "parenInterior"; };
<paren>")"            { console.log("parenEnd", parenCount); parenCount--; if (parenCount === 0) { this.popState(); return "parenEnd"; } else { return "parenInterior"; } };
<paren>[^\)\(]+       { console.log(this); return "parenInterior"; };
<<EOF>>               return 'EOF';
.                     return 'WHATEVER';

/lex

%start expressions

%% /* language grammar */

expressions
    : whateverSeq parenStart parenInteriorSeq parenEnd whateverSeq EOF { return $1 + $2 + $3 + $4 + $5; }
    ;

parenInteriorSeq
    : parenInterior 
    | parenInteriorSeq parenInterior -> $1.concat($2)
    ;

whateverSeq
    : -> ""      // Empty sequence.
    | whatevers  // One or more WHATEVER tokens.
    ;

whatevers
    : whatever
    | whateverSeq WHATEVER -> $1.concat($2)
    ;

%%

parenCount = 0;

Then there's no problem with nesting parentheses.

Salient changes:

  1. Replaced INVALID with WHATEVER. Added the rules to have a sequence of WHATEVER tokens at the start and end. This allows to have things like blah (foo) blah.

  2. Replaced parenInterior with parenInteriorSeq so that you can have sequence of parenInterior tokens inside parentheses. This is necessary because in a string like (foo()), foo is one token, the next ( is another token and the next ) is another token. So you have to accept a list of tokens.

Upvotes: 1

Related Questions