Reputation: 5795
This is a class project of sorts, and I've worked out 99% of all kinks, but now I'm stuck. The grammar is for MiniJava.
I have the following lex file which works as intended:
%{
#include "y.tab.h"
%}
delim [ \t\n]
ws {delim}+
comment ("/*".*"*/")|("//".*\n)
id [a-zA-Z]([a-zA-Z0-9_])*
int_literal [0-9]*
op ("&&"|"<"|"+"|"-"|"*")
class "class"
public "public"
static "static"
void "void"
main "main"
string "String"
extends "extends"
return "return"
boolean "boolean"
if "if"
new "new"
else "else"
while "while"
length "length"
int "int"
true "true"
false "false"
this "this"
println "System.out.println"
lbrace "{"
rbrace "}"
lbracket "["
rbracket "]"
semicolon ";"
lparen "("
rparen ")"
comma ","
equals "="
dot "."
exclamation "!"
%%
{ws} { /* Do nothing! */ }
{comment} { /* Do nothing! */ }
{println} { return PRINTLN; } /* Before {period} to give this pre
cedence */
{op} { return OP; }
{int_literal} { return INTEGER_LITERAL; }
{class} { return CLASS; }
{public} { return PUBLIC; }
{static} { return STATIC; }
{void} { return VOID; }
{main} { return MAIN; }
{string} { return STRING; }
{extends} { return EXTENDS; }
{return} { return RETURN; }
{boolean} { return BOOLEAN; }
{if} { return IF; }
{new} { return NEW; }
{else} { return ELSE; }
{while} { return WHILE; }
{length} { return LENGTH; }
{int} { return INT; }
{true} { return TRUE; }
{false} { return FALSE; }
{this} { return THIS; }
{lbrace} { return LBRACE; }
{rbrace} { return RBRACE; }
{lbracket} { return LBRACKET; }
{rbracket} { return RBRACKET; }
{semicolon} { return SEMICOLON; }
{lparen} { return LPAREN; }
{rparen} { return RPAREN; }
{comma} { return COMMA; }
{equals} { return EQUALS; }
{dot} { return DOT; }
{exclamation} { return EXCLAMATION; }
{id} { return ID; }
%%
int main(void) {
yyparse();
exit(0);
}
int yywrap(void) {
return 0;
}
int yyerror(void) {
printf("Parse error. Sorry bro.\n");
exit(1);
}
And the yacc file:
%token PRINTLN
%token INTEGER_LITERAL
%token OP
%token CLASS
%token PUBLIC
%token STATIC
%token VOID
%token MAIN
%token STRING
%token EXTENDS
%token RETURN
%token BOOLEAN
%token IF
%token NEW
%token ELSE
%token WHILE
%token LENGTH
%token INT
%token TRUE
%token FALSE
%token THIS
%token LBRACE
%token RBRACE
%token LBRACKET
%token RBRACKET
%token SEMICOLON
%token LPAREN
%token RPAREN
%token COMMA
%token EQUALS
%token DOT
%token EXCLAMATION
%token ID
%%
Program: MainClass ClassDeclList
MainClass: CLASS ID LBRACE PUBLIC STATIC VOID MAIN LPAREN STRING LB
RACKET RBRACKET ID RPAREN LBRACE Statement RBRACE RBRACE
ClassDeclList: ClassDecl ClassDeclList
|
ClassDecl: CLASS ID LBRACE VarDeclList MethodDeclList RBRACE
| CLASS ID EXTENDS ID LBRACE VarDeclList MethodDeclList RB
RACE
VarDeclList: VarDecl VarDeclList
|
VarDecl: Type ID SEMICOLON
MethodDeclList: MethodDecl MethodDeclList
|
MethodDecl: PUBLIC Type ID LPAREN FormalList RPAREN LBRACE VarDeclLi
st StatementList RETURN Exp SEMICOLON RBRACE
FormalList: Type ID FormalRestList
|
FormalRestList: FormalRest FormalRestList
|
FormalRest: COMMA Type ID
Type: INT LBRACKET RBRACKET
| BOOLEAN
| INT
| ID
StatementList: Statement StatementList
|
Statement: LBRACE StatementList RBRACE
| IF LPAREN Exp RPAREN Statement ELSE Statement
| WHILE LPAREN Exp RPAREN Statement
| PRINTLN LPAREN Exp RPAREN SEMICOLON
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
Exp: Exp OP Exp
| Exp LBRACKET Exp RBRACKET
| Exp DOT LENGTH
| Exp DOT ID LPAREN ExpList RPAREN
| INTEGER_LITERAL
| TRUE
| FALSE
| ID
| THIS
| NEW INT LBRACKET Exp RBRACKET
| NEW ID LPAREN RPAREN
| EXCLAMATION Exp
| LPAREN Exp RPAREN
ExpList: Exp ExpRestList
|
ExpRestList: ExpRest ExpRestList
|
ExpRest: COMMA Exp
%%
The derivations that are not working are the following two:
Statement:
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
If I only lex the file and get the token stream, the tokens match the pattern perfectly. Here's an example input and output:
num1 = id1;
num2[0] = id2;
gives:
ID
EQUALS
ID
SEMICOLON
ID
LBRACKET
INTEGER_LITERAL
RBRACKET
EQUALS
ID
SEMICOLON
What I don't understand is how this token stream matches the grammar exactly, and yet yyerror is being called. I've been trying to figure this out for hours, and I've finally given up. I'd appreciate any insight into what's causing the problem.
For a full example, you can run the following input through the parser:
class Minimal {
public static void main (String[] a) {
// Infinite loop
while (true) {
/* Completely useless // (embedded comment) stat
ements */
if ((!false && true)) {
if ((new Maximal().calculateValue(id1, i
d2) * 2) < 5) {
System.out.println(new int[11].l
ength < 10);
}
else { System.out.println(0); }
}
else { System.out.println(false); }
}
}
}
class Maximal {
public int calculateValue(int[] id1, int id2) {
int[] num1; int num2;
num1 = id1;
num2[0] = id2;
return (num1[0] * num2) - (num1[0] + num2);
}
}
It should parse correctly, but it is tripping up on num1 = id1;
and num2[0] = id2;
.
PS - I know that this is semantically-incorrect MiniJava, but syntactically, it should be fine :)
Upvotes: 0
Views: 235
Reputation: 241671
There is nothing wrong with your definitions of Statement
. The reason they trigger the error is that they start with ID
.
To start with, when bison processes your input, it reports:
minijava.y: conflicts: 8 shift/reduce
Shift/reduce conflicts are not always a problem, but you can't just ignore them. You need to know what causes them and whether the default behaviour will be correct or not. (The default behaviour is to prefer shift over reduce.)
Six of the shift/reduce conflicts come from the fact that:
Exp: Exp OP Exp
which is inherently ambiguous. You'll need to fix that by using actual operators instead of OP
and inserting precedence rules (or specific productions). That has nothing to do with the immediate problem, and since it doesn't (for now) matter whether the first Exp
or the second one gets priority, the default resolution will be fine.
The other ones come from the following production:
VarDeclList: VarDecl VarDeclList
| %empty
Here, VarDecl
might start with ID
(in the case of a classname used as a type).
VarDeclList
is being produced from MethodDecl
:
MethodDecl: ... VarDeclList StatementList ...
Now, let's say we're parsing the input; we've just parsed:
int num2;
and we're looking at the next token, which is num1
(from num1 = id1
). int num2;
is certainly a VarDecl
, so it will match VarDecl
in
VarDeclList: VarDecl VarDeclList
In this context, VarDeclList
could be empty, or it could start with another declaration. If it's empty, we need to reduce it right away (because we won't get another chance: non-terminals need to be reduced no later than when their right-hand sides are complete). If it's not empty, we can simply shift the first token. But we need to make that decision based on the current lookahead token, which is an ID
.
Unfortunately, that doesn't help us. Both VarDeclList
and StatementList
could start with ID
, so both reduce and shift are feasible. Consequently, bison
shifts.
Now, let's suppose that VarDeclList
used left-recursion instead of right-recursion. (Left recursion is almost always better in LR grammars.):
VarDeclList: VarDeclList VarDecl
| %empty
Now, when we reach the end of a VarDecl
, we have only one option: reduce the VarDeclList
. And then we'll be in the following state:
MethodDecl: ... VarDeclList · StatementList
VarDeclList: VarDeclList · VarDecl
Now, we see the ID
lookhead, and we don't know whether it starts a StatementList
or a VarDecl
. But it doesn't matter because we don't need to reduce either of those non-terminals; we can wait to see what comes next before committing to one or the other.
Note that there is a small semantic difference between left- and right-recursion in this case. Clearly, the syntax trees are different:
VDL VDL
/ \ / \
VDL Decl Decl VDL
/ \ / \
VDL Decl Decl VDL
| |
λ λ
However, in practice the most likely actions are going to be:
VarDeclList: %empty { $$ = newVarDeclList(); }
| VarDeclList VarDecl { $$ = $1; appendVarDecl($$, $2); }
which works just fine.
By the way:
1) While flex allows you to use definitions in order to simplify the regular expressions, it does not require you to use them, and nowhere is it written (to my knowledge) that it is best practice to use definitions. I use definitions sparingly, usually only when I'm going to write two regular expressions with the same component, or occasionally when the regular expression is really complicated and I want to break it down into pieces. However, there is absolutely no need to clutter your flex file with:
begin "begin"
...
%%
...
{begin} { return BEGIN; }
rather than the simpler and more readable
"begin" { return BEGIN; }
2) Along the same lines, bison
helpfully allows you to write single-character tokens as single-quoted literals: '('
. This has a number of advantages, starting with the fact that it provides a more readable view of the grammar. Also, you don't need to declare those tokens, or think up a good name for them. Moreover, since the value of the token is the character itself, your flex file can also be simplified. Instead of
"+" { return PLUS; }
"-" { return MINUS; }
"(" { return LPAREN; }
...
you can just write:
[-+*/(){}[\]!] { return yytext[0]; }
In fact, I usually recommend not even using that; just use a catch-all flex rule at the end:
. { return yytext[0]; }
That will pass all otherwise unmatched characters as single-character tokens to bison; if the token is not known to bison, it will issue a syntax error. So all the error-handling is centralized in bison, instead of being split between the two files, and you save a lot of typing (and whoever is reading your code saves a lot of reading.)
3) It's not necessary to put "System.out.println"
before "."
. They can never be confused, because they don't start with the same character. The only time order matters is if two patterns will maximally match the same string at the same point (which is why the ID
pattern needs to come after all the individual keywords).
Upvotes: 2