Mr Meeseeks
Mr Meeseeks

Reputation: 1

Bison error recovery and balanced parentheses

Suppose we have a file where each line is composed of strings of balanced parentheses. We can recognise this by putting the following classic grammar into Bison:

/* ignoring preamble and lexer definitions */
%% 
lines: line | lines line;
line: balanced NEWLINE;

balanced: %empty
| balanced '(' balanced ')'
| balanced '[' balanced ']'
| balanced '{' balanced '}'
| balanced '<' balanced '>';

Next improvement is to handle imbalanced brackets (or braces, or ...). In this case we want the parser to blindly consume everything up to the NEWLINE and start afresh on the next line. We can do this using bison's error recovery, although it is ungainly. But the below will shift the error token when it encounters an imbalanced bracket; it gains access to the any nonterminal and this allows it to slurp the rest of the line.

%% 
lines: line | lines line;
line: balanced NEWLINE;

balanced: %empty
| balanced '(' balanced ')'
| balanced '[' balanced ']'
| balanced '{' balanced '}'
| balanced '<' balanced '>';

/* below are new */

any: %empty 
| any '(' | any ')' | any '[' | any ']'
| any '{' | any '}' | any '<' | any '>';

line:
balanced error ')' any NEWLINE
| balanced error ']' any NEWLINE { /* do something special when an out of order ']' is detected */ }
| balanced error '}' any NEWLINE
| balanced error '>' any NEWLINE;

My question is how to handle the next improvement: autocomplete truncated lines. For example, if our line is

<>(([])<

I want to be able to determine what the closing string is, in this case >) or equivalently (< as I can reverse orders and swap closing for opening tokens. I know how to detect and skip over truncated lines:

line: balanced error NEWLINE;

but this results in everything from the first open paren to be discarded. Is there any way of accessing bison's internal token stack before it pops everything? Or some grammatical construct that will let me unwind it one at a time?

I'm currently doing something disgusting: pushing the opening symbols [({< onto a stack in a context object from the lexer, and when balanced is reduced popping the stack, so if we enter error recovery I can access the open brackets via the context. But, well, this is rotten.

I've also tried looking at the internals of the bison parser but it seems the internal stack is accessible from semantic error actions only after it has popped everything off it.

Upvotes: 0

Views: 145

Answers (1)

Chris Dodd
Chris Dodd

Reputation: 126120

Your first set of error recovery rules are unnecessarily complex -- error recovery will already discard tokens as needed to recover, so you never need rules like any.

Going back to your base code (no error recovery), you only need to add one rule to get the recovery action of "on error, discard everything up to a newline":

line: error NEWLINE

With just this rule, when an error occurs, it will discard the result of the current line and go on to the next line.


To understand why this works (and understand more complex cases), you need to understand the shift-reduce parser automata and how error recovery works. Basically, you have a state machine with a stack of states -- the state on top of the stack is the current state. At each step it looks at the current state and the lookahead token to decide what to do -- shift (which consumes the lookahead and pushes another state on the stack), reduce (which pops 0 or more states from the stack, runs a user action, and then does a goto action based on the rule executed), or error (generate a syntax error message and punt).

If you have error recovery rules, then whenever an error occurs, it will:

  1. Pop states from the stack until it finds one that can shift an error token
  2. Shift the error token (which pushes another state on the stack)
  3. Discard tokens from the input (maybe none) until it can do an action.
  4. Do the action (shift or reduce followed by goto). If it was a reduce+goto, go back to step 3 (continue discarding tokens). Only once it does a shift of a non-error token does it go back to 'normal' parsing (where another error will result in another error token.)

In step one, if it never finds a state that can shift an error token, it just exits the parser with a failure (and this is what will always happen if there are no error recovery rules).


So with just the line: error NEWLINE rule, when you have misbalanced input (like your example), when it gets to the NEWLINE, the stack will look something like

<top>
state for 'balanced: balanced '<' balanced ...
state for 'balanced: balanced '<' ...
state for 'balanced: balanced '(' balanced ...
state for 'balanced: balanced '(' ...
state for '(line|balanced): balanced ...
state for the beginning of a line
state for the beginning of the input
<bottom>

In step one, it will pop states until it gets to the 'beginning of the line' state, then it will shift to a state where the only possible action is to shift a newline, then it will discard lookahead until it gets to the newline (none in this case), then it will shift that newline.

If you instead use the rule line: balanced error NEWLINE it will work just as well, but will finish step 1 one state earlier, and when it reduces line: balanced error NEWLINE, that action will have access to $1 which will be the result of the correctly parsed balanced prefix (just <> in your example -- the intermediate ([]) was discarded along with the third state on the stack.


So now we can try additional rules to recover the intermediate state. One possibility is:

balanced: balanced '(' balanced error
        | balanced '[' balanced error
        | balanced '{' balanced error
        | balanced '<' balanced error

This will stop the popping of states immediately without popping anything (as the top state can handle an error), giving you a chance to access $1 and $3 in that state. But the problem is that it won't complete error recovery until after it has shifted a (non-error) token, so if there is a second error (as there is here) it will end up discarding the newline, which is probably not what you want.

At this point you can't do much more without mucking about with the internals of the error handling. You can have your actions look at and modify yychar (the lookahead token) and/or yyerrstatus, but the details of how to do that may vary with different versions of yacc or bison. One possibility is --yyerrstatus; in those 4 error actions -- that will exit the step 3/4 loop when it runs the action, so will avoid discarding more tokens. The net effect will be as if it is just inserting tokens (the missing )/]/}/>) to make things balanced.

balanced: balanced '(' balanced error { --yyerrstatus; printf("inserting missing ')'\n"); }
        | balanced '[' balanced error { --yyerrstatus; printf("inserting missing ']'\n"); }
        | balanced '{' balanced error { --yyerrstatus; printf("inserting missing '}'\n"); }
        | balanced '<' balanced error { --yyerrstatus; printf("inserting missing '>'\n"); }

along with these it might make sense to have

balanced: error ')' { printf("discarding unmatched ')'\n")' }
        | error ']' { printf("discarding unmatched ']'\n")' }
        | error '}' { printf("discarding unmatched '}'\n")' }
        | error '>' { printf("discarding unmatched '>'\n")' }

Though this might be undesirable as these rules end up taking precedence over the previous ones (due to have fewer symbols before the error they'll be higher on the stack). With the result that an input like

{(}

will result in

discarding unmatched '}'
inserting missing ')'
inserting missing '}'

You might be able to avoid that by refactoring the balanced rules into multiple versions based on whether there's an unbalanced open on the state stack or not (and what kinds there are), but that way lies madness.

Upvotes: 1

Related Questions