Kaz
Kaz

Reputation: 58500

Berkeley Yacc versus GNU Bison: different tolerance w.r.t trailing tokens

I'm running into a parser portability problem: different behavior when translated using byacc versus bison --yacc.

The Bison-generated parser allows me to call yyparse to extract just some prefix of the input token sequence which can be derived from the start symbol by the grammar rules. This is because Bison's generated parser has a $default reduce action. This, I think, means that "for any lookahead tokens which are not mentioned by other actions (because they don't match the grammar), perform this reduce".

By contrast, for the same state and set of rules, Berkeley Yacc does not have such a default reduce rule. The same reduction is keyed to matching the $end symbol specifically. In other words, in the the Berkeley-Yacc-generated parser, the rule is effectively "anchored" to the end, like a regex with a $ on it.

(Note: the Bison-generated parser still anchors the grammar as a whole to the end, but it it does this by matching on $end only in the top-level rule for the start symbol; it does not proliferate $end matching into the subordinate rules!)

The difference matters because I call yyparse more than once to extract successive phrasal units. This works with Bison, because I YYACCEPT before it has a chance to reduce to the start symbol where the implicit $end token is required. (Well, it doesn't exactly work "out of the box": but with a certain trick it be made to work). With Berkeley Yacc, the syntax error due to not seeing the $end in the subordinate rules which derive the start symbol means that the whole scheme is dead in the water.

Is there some way to get Berkeley Yacc to do a default reduce for lookahead token values which don't match the grammar-defined continuation or termination of the syntax? In other words, to populate all the unused entries in the LALR(1) table for that state with default reduces?

Idea: I'm thinking that perhaps the phrasal unit being extracted can be embroiled into a repetition rule. Instead of trying to parse out a single expr, I can parse a "sequence of expr" via such a newly introduced shim rule; but then immediately YYACCEPT after getting just one, along the lines of:

start : exprs { $$ = $1; };

exprs : expr { $$ = $1; YYACCEPT; /* Got just the one expr I really wanted! */ } 
      | exprs expr { /* Never reached! Byacc fooled! */ }
      ;

I will try it this way, but it would still be good to know why there is such a gaping difference between two Yacc implementations, and how to overcome it directly.


Edit: a hack which works in my project is along the lines of this pseudo-code:

start : expr { YYACCEPT; } byacc_fool { /*notreached*/ abort(); }

byacc_fool : expr { abort(); }
           | /*nothing*/ { abort(); }
           ;

All regression tests pass with Byacc or Bison.

None of the dummy actions are reached; they serve the purpose of creating a grammar rule which allows expr to be followed by another expr. But this is achieved in such a way that it won't actually consume the second expr; at the point of the YYACCEPT invocation, just one lookahead token is consumed from any following expr. (I have a solution in place to restore that token before each successive yyparse call; and that hack now works under Byacc.)

I still have a feeling like there is something simple I'm not seeing that everyone else knows.

Upvotes: 2

Views: 1308

Answers (1)

akim
akim

Reputation: 8759

FWIW, I think the analysis is slightly incorrect, and is misleading on $default.

$default is a compression mechanism: instead of a long list of lookaheads for which the reduction is the same, let's perform this reduction for all the remaining lookaheads, including those that were not valid. This does not allow the parser to accept more sentences; it can be proved that the error will be caught in some other rule. It does not either allow the parser to accept a longer prefix: the error will be caught at exactly the same point; however maybe a few more reductions will be performed. In exchange, your tables are smaller, since you have fewer cases to encode. Today it might be a useless optimization, but back then space was really critical.

I believe that both Bison and Byacc implement this technique (actually Bison and Byacc are a fork of the same program and still share quite a lot of code).

However, at some point in the history of Bison, the initial rule was added: $accept: exp $end, where exp denotes the user's start symbol. Before the handling of this rule was hardcoded in Bison (the generator) and the generated parsers, and even the tables were hand tuned to deal with exp and $end without using this "rule 0". I changed that, because (i) the theory of LR parsers does use this rule, so it was a problem when teaching theory to have Bison display different tables, and (ii) the code was much simpler, shorter, with this rule 0.

As an unexpected consequence (that change was done many many years ago, this is the first time I hear that it is observable), the $end token is treated as the other, and is subject to being fused within a $default action. This is were you can tell a difference between Byacc and Bison I guess.

That being said, I don't have another workaround to propose than yours :) Except, if you are ready to move to Bison only (which obviously is not the case), moving to a push-parser. This was done exactly to match your needs: interactive loops rather than batch parsing.

Upvotes: 0

Related Questions