Yacc/Lexer: how does this work, really?

Question

I have tried to sit here and read tutorials on how YACC works with the lex file however I'm not sure I can wrap my head around it. I understand it is for reading the actual input file and determining if functions such as add or subtract are in the right format, but how does this work? I understand how Lex files work and have constructed one that returns defined variables based on what it comes across in the file.

If I have a program that is very simple, how would I go about analyzing the first lines of this test programming language. Let's assume "program" is a defined value in lex, as well as "is", "var", "begin", "+", "print", ";", ",", and "end".

How would the yacc file need to be written in order for the first few lines to be read?

Test file:

program xyz is 
var a, b, c
begin
  a = 2;
  b = 3;
  c = a + b;
  print c
end

Yaccer.y

%token EOFNUM       0
%token SEMINUM      1
%token LPARENNUM    2
%token RPARENNUM    3
%token ICONSTNUM    4
%token BEGINNUM     5
%token PROGRAMNUM   6
%token MINUSNUM     7
%token TIMESNUM     8
%token VARNUM       9
%token COMMANUM     10
%token IDNUM        11
%token ENDNUM       12
%token ISNUM        13
%token PLUSNUM      14
%token DIVNUM       15
%token PRINTNUM     16
%token EQUALNUM     17

%left '+' '-'
%left '*' '/'

%%


%%

#include "lex.yy.c"
#include 

yyerror(str)
char *str;
{   printf("yyerror: %s at line %d
", str, yyline); }

main () {
    if (!yyparse()) { printf("accept
");}
    else { printf("reject
"); }
}

I know there needs to be some %types defined for the equations, however I am not sure how to go about using the types along with the %left declarations, or even if this is correct. I am also confused as to how I analyze the first line.

Dervall · Accepted Answer

You would need to create both a lex file and a yacc file. In the lex file you define the individual tokens for every type of token than your grammar contain. Then you construct the grammar in BNF form suitable for yacc. For your simple input it looks a bit like this. I assume that it's a typo that the print statement doesn't have a ; after it, and this grammar requires a semicolon.

%token IDENTIFIER
%token PROGRAM
%token BEGIN
%token END
%token IS
%token VAR
%token PRINT
%token NUMBER

program
statementlist
statement
printstatement
assignstatement 
expression

%%

program : PROGRAM IDENTIFIER IS VAR variables BEGIN statementlist END;

variables : IDENTIFIER | variables ',' IDENTIFIER;

statementlist : statement | statementlist statement;

statement : assignstatement | printstatement;

printstatement : PRINT IDENTIFIER ';';

assignstatement : IDENTIFIER '=' expression ';';

expression : value | expression '+' value;

value : NUMBER | IDENTIFIER;

%%

The %left and %right are associativity modifiers, and aren't really needed in your case. They would be needed if you were to support - using the same precedence as +. Well, technically not really needed but they make the grammar easier to understand.

Yacc isn't easy to understand, and I recommend a tutorial on the subject. Rules very quickly become recursive in nature and you have to have a certain mindset when working with it. It's easier to build your grammar incrementally, starting with the highest order statement. Make a parser that just recognizes program begin end and then work in the features as you go.

A good tool to test and verify grammars is the javascript parser generator found here

Yacc/Lexer: how does this work, really?

Answers (1)

Related Questions