Kulwant Singh
Kulwant Singh

Reputation: 131

Implementation of grammar using Bison. Grammar control flow unexpected

Abstract grammar. Actual grammar follows:

start ->  t1  V1;

V1 -->   t2   V1
       | t3   V2
       ;

V2 -->   t4
       | /* Empty */
       ;

When the control is in V2 and token t3 is encountered. The control goes back to V1. All right to this point.

But the control doesn't stop at V1 to trigger t3, but goes back to start, this is the problem.

The actual grammar. Explanation follows.

%{
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

void yyerror (char *s); 
extern FILE *yyin;

%}

%token MAIN   

%token INT_DT INT_LIT   /* t_INT_DT --> Integer Data type, i.e. "int"
                         * t_INT_LIT --> An integer, i.e. "5".
                         */

%token VAR              // A variable

%token SCANF  PRINTF    // For the printf and the scanf statements.

%token IF     ELSE      // For the if and else statements.
%token GOTO

%token OPRTR            // For the operator in an expression.

/* Begining of a basic block, BBS, Basic Block Statement.
 * BBS_ST = "<bb "
 * BBS_END1 ">:"
 * BBS_END2   ">;"
 * Example:   
 * <bb 2>:
 *      Statements;
 *      goto <bb 3>;
 */
%token BBS_ST   // "<bb "// it is "<bb ".
%token BBS_END1 // ">:" 
%token BBS_END2 // ">;" 

// Opening and closing Braces, i.e. { & }
%token OP_BR  // "{" 
%token CL_BR  // "}"

// For opening and closing Parantheses, ( & )
%token OP_PR  //"("  
%token CL_PR  //")"

%token EQUAL  // "="
%token COMMA  // ","       
%token NBSP   // "&"
%token SM_CLN // ";"       // For ";" 

%token RETURN           // For the return statement.

%start begin

%%

begin:   MAIN    main   ;

main:    OP_BR    functn   CL_BR   ;

functn:   INT_DT   VAR      SM_CLN   functn  /*Recursion to stay in functn */
        | BBS_ST   INT_LIT  BBS_END1  bb     
        ;

bb:    SCANF    var_List   CL_PR    SM_CLN  bb
    |  PRINTF   var_List   CL_PR    SM_CLN  bb
    |  VAR      EQUAL      expr     SM_CLN  bb
    |  IF       OP_PR      expr     CL_PR   bb
    |  ELSE     bb
    |  GOTO     BBS_ST     INT_LIT  BBS_END2
    |  RETURN   SM_CLN    
    |  /* Empty Rule. */  
    ;  /* bb has can't catch BBS_ST as first token, thus it must return
        * to the calling rule and functn can catch BBS_ST. */

var_List:    COMMA   VAR   var_List 
          |  /* Empty Rule. */ 
          ;

expr:     VAR      expr2 
      |   INT_LIT  expr2  
      ;

expr2:    OPRTR    expr3  
       |  /* Empty Rule */
       ;

expr3:    VAR     
       |  INT_LIT  
       ;

%%
int main (int argc, char *argv[])
{
    #if YYDEBUG == 1
        extern int yydebug;
        yydebug = 1;
    #endif

    if (argc == 2)
    {
        yyin = fopen (argv [1], "r");
        if (yyin == NULL)
            perror (argv [1]); 
    }

    yyparse ();

    fclose (yyin);
    return 0;    
}


void yyerror (char *s)
{
    printf ("Error: %s\n", s);
}

Implementation of ideas suggested in answers till now, but this doesn't help.

begin:  MAIN main ; 

main:   OP_BR   functn_Decls   CL_BR 
        ; 

functn_Decls:   functn_Decls INT_DT VAR SM_CLN 
              | functn 
              ; 

functn:    functn BBS_ST INT_LIT BBS_END1 
        |  bb 
        ; 

bb:    bb   stmnt 
    |  /* Empty Rule */ 
    ; 

stmnt:   t1  // terminals only. 
       ; 

var_List:   t2 // terminal just for illustration. 
            ;

expr:  t3 // terminal just for illustration.
       ;

Error causing input:

main ()
{
  <bb 2>:
  goto <bb 4>;

  <bb 3>:
}

Error:

At <bb the token BBS_ST is triggered. At this time the control is in bb, like bb --> Waiting for a token Since no rule in bb starts with BBS_ST, it returns to the calling rule.

Now the calling rule is functn, which has a rule starting with BBS_ST. Problem is that, this rule is not invoke.

Instead the control reaches the parent of functn at P1. If at P1 I add BBS_ST, It catches this token.

Please point out why is it happening like this?

Corresponding Flex file:

%{
    // Include the file
    #include "Parse_One_Line2.tab.h"
    extern YYSTYPE  yylval;
    extern char    *yytext;
%}

LB   ^[ \t]*
CH   [a-zA-Z]
DI   [0-9]

%%

{LB}main[ ]*\([.]*\)    return MAIN; // the main() call.

{LB}int[ ]              return INT_DT;

[+-]?{CH}({CH}{DI}_)*   return VAR; 

[+-]?[0-9]+             return INT_LIT;  

{LB}scanf[ ]?\(\".*\"   return SCANF;

{LB}printf[ ]?\(\".*\"  return PRINTF;

{LB}if[ ]               return IF; // If statement.

{LB}else[ ]             return ELSE; // If statement.

{LB}goto[ ]             return GOTO;

{LB}return              return RETURN;

\{CLOBBER\};            ;// Ignore it.

"<bb "                  return BBS_ST;  // next integer is a BB Index.

">:"                    return BBS_END1; // token Greater than Colon.

">;"                    return BBS_END2;

^;;[^\n]*[\n]           ; // Ignore it, comment.

"+"|"-"|"*"|"/"|"%"|"=="|">="|"<="|"<"|">"|"!=" return OPRTR; 

^[{]                    return OP_BR; // It is "{".

^[}]                    return CL_BR; // It is "}".

"("                     return OP_PR; // Paranthesis Open

")"                     return CL_PR; // Paranthesis Close


";"                     return SM_CLN;

"="                     return EQUAL;

"&"                     return NBSP;

","                     return COMMA;

[\n]                    ; /* Skip these characters. */

.                       ; /* Skip.   */

%%


int yywrap (void)
{
    return 1;
}

The Output file using YYDEBUG flag.

Starting parse
Entering state 0
Reading a token: Next token is token MAIN ()
Shifting token MAIN ()
Entering state 1
Reading a token: Next token is token OP_BR ()
Shifting token OP_BR ()
Entering state 3
Reading a token: Next token is token INT_DT ()
Shifting token INT_DT ()
Entering state 6
Reading a token: Next token is token VAR ()
Shifting token VAR ()
Entering state 9
Reading a token: Next token is token SM_CLN ()
Shifting token SM_CLN ()
Entering state 12
Reading a token: Next token is token BBS_ST ()
Shifting token BBS_ST ()
Entering state 7
Reading a token: Next token is token INT_LIT ()
Shifting token INT_LIT ()
Entering state 10
Reading a token: Next token is token BBS_END1 ()
Shifting token BBS_END1 ()
Entering state 13
Reading a token: Next token is token GOTO ()
Shifting token GOTO ()
Entering state 20
Reading a token: Next token is token BBS_ST ()
Shifting token BBS_ST ()
Entering state 29
Reading a token: Next token is token INT_LIT ()
Shifting token INT_LIT ()
Entering state 38
Reading a token: Next token is token BBS_END2 ()
Shifting token BBS_END2 ()
Entering state 47
Reducing stack by rule 10 (line 76):
   $1 = token GOTO ()
   $2 = token BBS_ST ()
   $3 = token INT_LIT ()
   $4 = token BBS_END2 ()
-> $$ = nterm bb ()
Stack now 0 1 3 6 9 12 7 10 13
Entering state 22
Reducing stack by rule 4 (line 68):
   $1 = token BBS_ST ()
   $2 = token INT_LIT ()
   $3 = token BBS_END1 ()
   $4 = nterm bb ()
-> $$ = nterm functn ()
Stack now 0 1 3 6 9 12
Entering state 14
Reducing stack by rule 3 (line 67):
   $1 = token INT_DT ()
   $2 = token VAR ()
   $3 = token SM_CLN ()
   $4 = nterm functn ()
-> $$ = nterm functn ()
Stack now 0 1 3
Entering state 8
Reading a token: Next token is token BBS_ST ()
Error: popping nterm functn ()
Stack now 0 1 3
Error: popping token OP_BR ()
Stack now 0 1
Error: popping token MAIN ()
Stack now 0
Cleanup: discarding lookahead token BBS_ST ()
Stack now 0

Upvotes: 0

Views: 311

Answers (2)

Chris Hunt
Chris Hunt

Reputation: 4030

I reorganized a few of your rules. The problem is that when functn sees a bb, it expects only one. As a result when your GOTO is reached no further tokens are expected. Allowing functns to appear after a bb statement should fix this and give you the behavior you're looking for.

functn: INT_DT VAR ";" functn
      | bb_stmt functn
      | bb_stmt
      ;

bb_stmt: BBS_ST /*P2*/ INT_LIT ">:" bb
       ;

If you want to continue to force statements to be only at the top of the function, you can do something like this in addition to the above:

main:    "{"     functn_decls /*P1*/   "}"   ;

functn_decls: INT_DT VAR ";" functn_decls
            | functn
            ;

functn: bb_stmt functn
      | bb_stmt
      ;

Upvotes: 1

rici
rici

Reputation: 241671

Your grammar does not do what you think it does. (Here, I'm just using the reduced grammar because it's simpler than wading through the entire grammar and, as indicated, the principle is the same.)

start →  t1  V1

V1    →  t2   V1
      |  t3   V2

V2    →  t4
      |  /* Empty */

Let's ask the simple question: What can follow V2? Since the only place where V2 is used in the grammar is in the production V1 → t3 V2, the answer can only be: Exactly the same tokens as could follow V1.

So what can follow V1? Again, there are only two places where V1 is used, and both are at the end of productions. One is recursive, so it doesn't contribute to the follow set, and the other one is in start → t1 V1. So we can conclude that the only token which can follow V1 (and hence the only token which can follow V2) is the End-of-Input token, conventionally written $.

So t3 cannot follow V1 or V2, and as a result, the sentence:

t1 t3 t3

is not part of the language; the second t3 cannot be parsed.


On a more general note, you seem to be trying to analyze the behaviour of the generated parser as though it were a top-down recursive-descent parser. Bison does not produce top-down parsers; it generates bottom-up LALR(1) parsers (by default), and your concept of "control flow" does not match anything which is going on in the LR(k) algorithm.

Also, LR(1) grammars do not have any problem with left-recursion, so it is unnecessary to mutilate the grammar as you would need to do for a top-down parser generator. You will find that left-recursion works much better, because it actually reflects the structure of the language. If you analyze

statement1 ; statement2 ; statement3 ;

using a right-recursive grammar:

Program → ε | Statement ; Program

you will find that the statement reductions are applied right-to-left, which is probably backwards from your perspective. That's because an LR parser produces the rightmost derivation, so it will not start doing reductions until it reaches the end of the input:

  statement1 ; statement2 ; statement3 ;
→ statement1 ; statement2 ; statement3 ; Program (Program → ε)
→ statement1 ; statement2 ; Program              (Program → Statement ; Program)
→ statement1 ; Program                           (Program → Statement ; Program)
→ Program                                        (Program → Statement ; Program)

It is most likely that what you really wanted was more like the following (going back to something like your reduced grammar):

S  → t1 | S V1 ';'
V1 → t2 | t3 V2
V2 → t4 | /* Empty */

which makes S a t1 followed by an arbitrary number (possibly 0) of either t2, t3 or t3 t4, each followed by a semicolon.

Upvotes: 1

Related Questions