Bison read strings together

Question

I would like to make a project with Flex and Bison. I have a grammar (that's only a part of mine):

variable_name:
 | text {printf("VARIABLE NAME (TEXT) IN BISON: %s
", $1); $$ = _variable_init($1);}
 | character {printf("VARIABLE NAME (CHARACTER) IN BISON: %s
", $1); $$ = _variable_init($1);}
 ;

bool_expr:
 | true_t { printf("BOOL EXP TRUE
"); $$ = _bool_expression_init_bool(TRUE); }
 | false_t { printf("BOOL EXP FALSE
"); $$ = _bool_expression_init_bool(FALSE); }
 | variable_name { printf("BOOL EXP VARIABLE: %s
", $1->name); $$ = _bool_expression_init_variable($1); }
 | bool_expr eq bool_expr { printf("BOOL EXP ==
"); $$ = _bool_expression_init_binary_op($1, "==", $3); }
 | bool_expr noteq bool_expr { printf("BOOL EXP !=
"); $$ = _bool_expression_init_binary_op($1, "!=", $3); }
 | bool_expr and bool_expr { printf("BOOL EXP AND
"); $$ = _bool_expression_init_binary_op($1, "&&", $3); }
 | bool_expr or bool_expr { printf("BOOL EXP OR
"); $$ = _bool_expression_init_binary_op($1, "||", $3); }
 | '!' bool_expr { printf("BOOL EXP NOT
"); $$ = _bool_expression_init_unary_op("!", $2); }
 | '(' bool_expr ')' { printf("BOOL EXP ()
"); $$ = _bool_expression_init_with_brackets($2); }
 | int_expr eq int_expr { printf("BOOL EXP INT==
"); $$ = _bool_expression_init_from_int($1, "==", $3); }
 | int_expr noteq int_expr { printf("BOOL EXP INT!=
"); $$ = _bool_expression_init_from_int($1, "!=", $3); }
 | int_expr get int_expr { printf("BOOL EXP INT>=
"); $$ = _bool_expression_init_from_int($1, ">=", $3); }
 | int_expr let int_expr { printf("BOOL EXP INT<=
"); $$ = _bool_expression_init_from_int($1, "<=", $3); }
 | int_expr '<' int_expr { printf("BOOL EXP INT<
"); $$ = _bool_expression_init_from_int($1, "<", $3); }
 | int_expr '>' int_expr { printf("BOOL EXP INT>
"); $$ = _bool_expression_init_from_int($1, ">", $3); }
 ;

int_expr:
 | num { printf("INT EXP NUM
"); $$ = _int_expression_init_int($1); }
 | variable_name { printf("INT EXP VARIABLE
"); _int_expression_init_variable($1); }
 | int_expr '+' int_expr { printf("INT EXP +
"); $$ = _int_expression_init_binary_op($1, "+", $3); }
 | int_expr '-' int_expr { printf("INT EXP -
"); $$ = _int_expression_init_binary_op($1, "-", $3); }
 | int_expr '*' int_expr { printf("INT EXP *
"); $$ = _int_expression_init_binary_op($1, "*", $3); }
 | int_expr '/' int_expr { printf("INT EXP /
"); $$ = _int_expression_init_binary_op($1, "/", $3); }
 | '(' int_expr ')' { printf("INT EXP ())
"); $$ = _int_expression_init_bracket($2); }
 ;

I copied only the important parts (hopefully there is the issue). So when I want to parse this

var != 10

as a bool_expr the Bison identifies var as a variable and prints:

VARIABLE NAME (TEXT) IN BISON: var

but the in the next moment it prints

BOOL EXP VARIABLE: var !=

and it thinks that the variable is the "var !=" when there is a rule

int_expr != int_expr

but it doesn't check this part. Btw variable has Variable* type (struct), int_expr has IntExpression* (struct), bool_expr has BoolExpression* (struct). I don't what I should do. I tried and it worked when I write 6 additional rule to bool_expr that are almost same as the last 6. I replaced the first int_expr to variable, but it's disgusting. As I knew Bison search for the longest match not the first. The log:

VARIABLE NAME (TEXT) IN BISON: var
VARIABLE NAME IN FUNCTION: var //It was called in variable_init function
BOOL EXP VARIABLE: var!=
INT EXP NUM //10

rici · Accepted Answer

There is nothing in your grammar which allows the parser to distinguish between integer variables and boolean variables. Both are simply variable_name; moreover, either a boolean variable or an integer variable could be followed by a comparison operator. (In the case of var != 10, it would be possible to deduce that var had to be an integer variable, once it is seen that the right-hand side of the operator is an integer. But a != b would still be ambiguous, since the variable names do not carry a marker to indicate their type.)

Bison, by default, produces an LALR(1) parser, which means that every reduction needs to be determined by examining at most one token following the end of the production. (That's what the "(1)" in "LALR(1)" means.) In other words, the parser would have to be able to decide between reducing the variable_name "var" to a bool_expr or an int_expr no later than when it sees the != token. That's not possible, and because it is not possible, Bison should have reported a reduce/reduce conflict, which you are apparently ignoring. (If it didn't, then the rest of your grammar is relevant. But that would be surprising.)

Bison doesn't just give up when it sees a conflict. It makes a somewhat arbitrary choice (in the case of reduce/reduce conflicts, it chooses the reduction which occurs earliest in the grammar, in this case to bool_expr) and builds the parser regardless. Occasionally, this default produces the correct parse, but in most cases the resulting parser is flawed, with behaviour similar to what you are experiencing. So although Bison claims that the conflict report is just a warning, you ignore it at your peril.

As @user253751 notes in a comment, you can ask Bison to produce a GLR grammar, which allows arbitrary lookahead (at the cost of slowing down the parse). However, Bison's GLR implementation still requires the grammar to be unambiguous. Ambiguous parses will be detected at run-time, during the parse, and will cause the parse to fail; that would be the case with the vara != varb ambiguous expression noted above. (You can provide your own ambiguity resolution mechanism. But that's an extremely advanced technique, and in this case it won't work unless the ambiguity resolver has access to semantic information, like the declared type of each variable.)

Without seeing the rest of your grammar, it's hard to know whether type resolution could be done at compile-time (because variables need to be declared with a specific type), or only at run-time, but in neither case can that determination be made by the parser. So if you have boolean variables (and even if you don't), you cannot do better in the parser than to just have one expression non-terminal.

If you require declaration before use, then you could do type-resolution in your reduction actions by consulting the symbol table you are building up. At that point, you can either insert an automatic conversion function, or report an error (depending on whether you feel that automatic conversion is convenient). If you only require declaration (not prior declaration), then you can do the type-resolution after the parse is complete, by walking the AST twice: first to build up the symbol table, and second to resolve types.

If you consider type mismatches to be an error, then reporting the error in a semantic action is a lot more user-friendly. During the parse, it is very difficult to produce an error message more informative than "syntax error at line 10". But in the semantic action, you know precisely what the error is and what produced it, so it's easy to produce error messages like "the comparison at line 10 requires that 'var' be an integer variable, not a boolean variable." Your users will thank you.

By the way, the usual convention is to use UPPER_CASE for the symbolic names of terminals, like !=, which would usually be named something like NOTEQ or T_EQ. But if you are using Bison, you can make your grammar a lot more readable by using quoted aliases for your terminals:

%token EQ "==" NOTEQ "!="
       GEQ ">=" LEQ "<="

Then you can use the symbolic names in your lexical analyser:

"!="   { return NOTEQ; }

without forcing whoever reads your grammar to guess what the name means:

expr: expr "==" expr
    | expr "!=" expr
    | ...

You must use double quotes; '+' is quite different.

Bison read strings together

Answers (1)

Related Questions