flogram_dev
flogram_dev

Reputation: 42858

Bison subscript expression unexpected error

With the following grammar:

program: /*empty*/ | stmt program;
stmt: var_decl | assignment;
var_decl: type ID '=' expr ';';
assignment: expr '=' expr ';';
type: ID | ID '[' NUMBER ']';
expr: ID | NUMBER | subscript_expr;
subscript_expr: expr '[' expr ']';

I'd expect the following to be valid:

array[5] = 0;

That's just an assignment with a subscript_expr on the left-hand-side. However the generated parser gives an error for that statement:

syntax error, unexpected '=', expecting ID

Generating the parser also warns that there's 1 shift/reduce conflict. Removing subscript_expr makes it go away.

Why does this happen and how can I get it to parse array[5] = 0; as an assignment with a subscript_expr?

I'm using Bison 2.3.

Upvotes: 1

Views: 145

Answers (2)

rici
rici

Reputation: 241911

The following two statements are both valid in your language:

x [ 3 ] = 42;

x [ 3 ] y = 42;

The first is an assignment of an element of the array variable x, while the second is a declaration and initialization of the array variable y whose elements are of type x.

But from the parser's viewpoint, x and y are both just IDs; it has no way of knowing that x is a variable in the first case and a type in the second case. All it can do is notice that the two statements match the productions assignment and var_decl, respectively.

Unfortunately, it cannot do that until it sees the token after the ]. If that token is an ID, then the statement must be a var_decl; otherwise, it's an assignment (assuming the statement is valid, of course).

But in order to parse the statement as an assignment, the parser must be able to produce

expr '=' expr

which in this case is the result of expr: subsciprt_expr, which in turn is subscript_expr: expr[expr]`.

So the set of reductions for the first statement will be as follows: (Note: I didn't write the shifts; rather, I mark the progress of the parse by putting a • at the end of each reduction. To get to the next step, just shift the • until you reach the end of the handle.)

ID • [ NUMBER ] = NUMBER ;             expr: ID
expr [ NUMBER • ] = NUMBER ;           expr: NUMBER
expr [ expr ] • = NUMBER ;             subscript_expr: expr '[' expr ']'
subscript_expr • = NUMBER ;            expr: subscript_expr
expr = NUMBER • ;                      expr: NUMBER
expr = expr ; •                        assignment: expr '=' expr ';'
assignment

The second statement must be parsed as follows:

ID [ NUMBER ] • ID = NUMBER ;          type: ID '[' NUMBER ']'
type ID = NUMBER • ;                   expr: NUMBER
type ID = expr ; •                     var_decl: type ID '=' expr ';'
var_decl

That's a shift/reduce conflict, because the crucial decision must be made immediately after the first ID. In the first statement, we need to reduce the identifier to an expr. In the second statement, we must continue shifting until we are ready to reduce a type.

Of course, this problem wouldn't exist if we could lexically distinguish type IDs from variable name IDs, but that may not be possible (or, if possible, it may not be desirable because it requires feedback from the parser to the lexer).

As written, the shift/reduce prediction can be made with fixed lookahead, since the fourth token after the ID will determine the possibilities. That makes the grammar LALR(4), but that doesn't help much since bison only implements LALR(1) parsers. In any case, it is likely that a less simplified grammar will not be fixed-lookahead, for example if constant expressions are allowed for array sizes, or if arrays can have multiple dimensions.

Even so, the grammar is not ambiguous, so it is possible to use a GLR parser to parse it. Bison does implement GLR parsers; it is only necessary to insert

%glr-parser

into the prologue. (The shift/reduce warning will still be produced, but the parser will correctly identify both kinds of statement.)

It's worth noting that C doesn't have this particular parsing problem precisely because it puts the array size after the name of the variable being declared. I don't believe this was done to avoid parsing problems (although who knows?) but rather because it was believed that it is more natural to write declarations the way variables are used. Hence, we write int a[3] and char *p, because in the program we will dereference using a[i] and *p.

It is possible to write an LALR(1) grammar for this syntax, but it's a bit annoying. The key is to delay the reduction of the syntax ID [ NUMBER ] until we know for sure which production it will be the start of. That means we need to include the production expr: ID '[' NUMBER ']'. That will result in a larger number of shift/reduce warnings (since it makes the grammar ambiguous), but since bison always prefers to shift, it should produce a correct parser.

Upvotes: 2

flogram_dev
flogram_dev

Reputation: 42858

Adding %glr-parser solves this.

Upvotes: 1

Related Questions