Issue with multiple line comment grammar - LEX/YACC

So, basically I'm just trying to ignore comments in lex and not pass the comments to yacc at all. For some reason, when I have multiple line comments, my parser will just print out the comment when it's supposed to print nothing.

Here is the issue I run into:

I'm honestly not sure what is going on. Nothing is supposed to print out here. Why does it print the comment? Is my grammar wrong in my lex file?

Here is my lex file:

 %{
/*constants are defined outside of the l file in y.tab.h
*constants are defined from 257
*/

#include "y.tab.h"
int input_line_no = 1;
char the_tokens[1000];
char full_line[1000];
int lex_state = 0;

%}

whitespace         [ \t]
number             [0-9]
letter             [A-Za-z]
alfanum            [A-Za-z0-9_]
intcon             {number}+
id                 {letter}{alfanum}*
anything           .

%option noyywrap
 /*
 *Start conditions are specified to identify comments, 
 *literal strings, and literal chars. 
 */

%Start  string_in char_in 

%x COMMENT
%%

 /*identify comment*/
^[\t]*"{".*"}"[\t]*\n ;
^[\t]*"/*" {lex_state = 1; BEGIN COMMENT;}
^[\t]*"/*".*"*/"[\t]*\n ;

<COMMENT>"*/"[\t]*\n {lex_state=0; BEGIN 0;}
<COMMENT>"*/" {lex_state=0; BEGIN 0;}
<COMMENT>\n ;
<COMMENT>.\n ;

 /*tokenization of special strings*/
"extern"        {return EXTERN;}
"if"            {return IF;}
"else"          {return ELSE;}
"void"          {return VOID;}
"char"          {return CHAR;}
"int"           {return INT;}


 /*line number is recorded*/
[\n]                        input_line_no++;


 /*start tokenization of strings*/
<INITIAL>\"             {
                lex_state = 2;
                                BEGIN(string_in);

                        }
<string_in>[^"]     {
                return(STRINGCON);
            }
<string_in>\"       {
                lex_state = 0;
                BEGIN(INITIAL);
            }
 /*tokenization of characters*/
<INITIAL>\' {
            lex_state = 3;
            BEGIN(char_in);
        }
<char_in>[^']
        {
            return(CHARCON);
        }
<char_in>\\n    {
            return(CHARCON);
        }
<char_in>\\0    {
            return(CHARCON);
        }
<char_in>\' {
            lex_state = 0;
            BEGIN(INITIAL);
        }

{whitespace}    ;

 /*tokenization of numbers*/
{intcon}         {return(INTCON);}
{id}        {return ID;}

 /*tokenization of operations*/
"=="        {return EQUALS;}
"!="        {return NOTEQU;}
">="        {return GREEQU;}
"<="        {return LESEQU;}
">"     {return GREATE;}
"<"     {return LESSTH;}

"&&"        {return ANDCOM;}
"||"        {return ORCOMP;}
"!"             {return ABANG;}

";"     {return SEMIC;}
","     {return COMMA;}
"("     {return LPAR;}
")"     {return RPAR;}      
"["     {return LBRAC;}
"]"     {return RBRAC;}
"{"     {return LCURL;}
"}"     {return RCURL;}

"+"     {return ADD;}
"-"     {return SUB;}
"*"     {return MUL;}
"/"     {return DIV;}
"="     {return EQUAL;}

 /*For strings that can not be identified by any patterns specified previously
 *lex returns the value of the character
 */

{anything}     {return(OTHER);}

%%

Here is my yacc file:

%{

#include <stdio.h>
#define YDEBUG
#ifndef YDEBUG

#define Y_DEBUG_PRINT(x)

#else

#define Y_DEBUG_PRINT(x) printf("Yout %s \n ",x)

#endif
int yydebug = 0; 

extern char the_token[]; 
 /* This is how I read tokens from lex... :) */
extern int input_line_no; 
 /* This is the current line number */
extern char *full_line; 
 /* This is the full line */
extern int lex_state;


%}

%token STRINGCON CHARCON INTCON EQUALS NOTEQU GREEQU LESEQU GREATE LESSTH
%token ANDCOM ORCOMP SEMIC COMMA LPAR RPAR LBRAC RBRAC LCURL RCURL ABANG
%token EQUAL ADD SUB MUL DIV ID EXTERN FOR WHILE RETURN IF ELSE 
%token VOID CHAR INT OTHER

%left ORCOMP
%left ANDCOM
%left EQUALS NOTEQU
%left LESSTH GREATE LESEQU GREEQU
%left ADD SUB
%left MUL DIV
%right UMINUS
%right ABANG

%start prog
%%

prog:
| dcl SEMIC prog2
| Function prog2 

prog2:
| dcl SEMIC prog2 
| Function  prog2 

dcl: VAR_list 
| ID LPAR Param_types RPAR dcl2 
| EXTERN ID LPAR Param_types RPAR dcl2 
| EXTERN Type ID LPAR Param_types RPAR dcl2 
| EXTERN VOID ID LPAR Param_types RPAR dcl2 
| Type ID LPAR Param_types RPAR dcl2 
| VOID ID LPAR Param_types RPAR dcl2 

dcl2: 
| COMMA ID LPAR Param_types RPAR dcl2 

Function: Functionhead LCURL Functionbody RCURL 
| VOID Functionhead LCURL Functionbody RCURL 
| Type Functionhead LCURL Functionbody RCURL 

Functionhead: ID LPAR Param_types RPAR 

Functionbody: 
|VAR_list STMT_list 

Param_types: VOID 
|Param_types1 

Param_types1: Param_type1 
| Param_types1 COMMA Param_type1 

Param_type1: Type ID Param_type11 

Param_type11: 
| LBRAC RBRAC 

VAR_list: Type VAR_list2 

VAR_list2: var_decl 
| var_decl COMMA VAR_list2 

var_decl: ID 
| ID LBRAC INTCON RBRAC 

Type: CHAR 
|INT

STMT_list: STMT2 

STMT2: STMT 
| STMT STMT2 

STMT : IF LPAR Expr RPAR STMT 
| IF LPAR Expr RPAR STMT ELSE STMT
 /*if cats) ERROR*/
| IF Expr RPAR STMT ELSE STMT {warn("STMT-IF: missing LPAR");}
 /*if (cats ERROR*/
| IF LPAR Expr STMT ELSE STMT {warn("STMT-IF: missing RPAR");}
 /*two elses ERROR*/
| IF LPAR Expr STMT ELSE ELSE STMT {warn(":too many elses");}
| WHILE LPAR Expr RPAR STMT
 /*for(c=0;c<1;c++)*/
| FOR LPAR Assign SEMIC Expr SEMIC Assign RPAR STMT 
 /*for(;c<1;c++)*/
| FOR LPAR SEMIC Expr SEMIC Assign RPAR STMT 
 /*for(;;c++)*/
| FOR LPAR SEMIC SEMIC Assign RPAR STMT 
 /*for(;;)*/
| FOR LPAR SEMIC SEMIC RPAR STMT 
 /*for(c=0;;)*/
| FOR LPAR Assign SEMIC SEMIC RPAR STMT 
 /*for(c=0;c<1;)*/
| FOR LPAR Assign SEMIC Expr SEMIC RPAR STMT 
 /*for(c=0;;c++)*/
| FOR LPAR Assign SEMIC SEMIC Assign RPAR STMT 
 /*for(;c<1;)*/
| FOR LPAR SEMIC Expr SEMIC RPAR STMT 
 /*for() ERROR*/
| FOR LPAR RPAR STMT {warn("STMT-FOR: empty statement");}
 /*for{;;;) ERROR*/
| FOR LPAR SEMIC SEMIC SEMIC RPAR {warn("STMT-FOR: too many semicolons");}
 /*for;;) ERROR*/
| FOR SEMIC SEMIC RPAR STMT {warn("STMT-FOR: missing LPAR");}
 /*for(;; ERROR*/   
| FOR LPAR SEMIC SEMIC STMT {warn("STMT-FOR: missing RPAR");}
| RETURN Expr SEMIC 
| RETURN SEMIC 
 /*return ERROR*/
| RETURN {warn("STMT-Return:missing semicolon");}
| Assign SEMIC 
/*function call*/
| ID LPAR RPAR SEMIC 
| ID LPAR Expr Expr2 RPAR SEMIC 
 /*No semic ERROR*/
| ID LPAR Expr Expr2 RPAR {warn(":missing semicolon");}  
| LCURL STMT2 RCURL 
| LCURL RCURL 
| SEMIC

Assign : ID Assign1 EQUAL Expr 
 /*Error no semi*/
| Assign {warn( "Assign: missing semicolon on line");}

Assign1 : 
| LBRAC Expr RBRAC
| LBRAC Expr error { warn("Assign1: missing RBRAC"); }
| error Expr RBRAC { warn("Assign1: missing LBRAC"); }
| LBRAC error RBRAC { warn("Assign1: Invalid array index"); }

Expr : SUB Expr %prec UMINUS
| ABANG Expr 
| Expr Binop Expr 
| Expr Relop Expr
| Expr Logop Expr 
| ID 
| ID LPAR RPAR 
| ID LPAR Expr Expr2 RPAR 
| ID LBRAC Expr RBRAC 
| LPAR Expr RPAR 
| INTCON 
| CHARCON 
| STRINGCON 
| Array 
| error {warn("Expr: invalid expression "); }

/*top is for no expression 2*/
Expr2: 
| COMMA Expr 
 /*recursively looks for another expression in function call (exp1,exp2,exp3,...*/
| COMMA Expr Expr2


Array : 
ID LBRAC Expr RBRAC 
| ID error RBRAC {warn( "Array: invalid array expression"); }

Binop : ADD 
| SUB 
| MUL 
| DIV 

Logop : ANDCOM 
| ORCOMP 

Relop : EQUALS 

| NOTEQU 

| LESEQU 

| GREEQU 

| GREATE 

| LESSTH 


%%

main()
{
int result = yyparse();
if (lex_state==1) {
yyerror("End of file within a comment");
}
if (lex_state==2) {
yyerror("End of file within a string");
}
return result;
} 
int yywrap(){
return 1;
}
yyerror(const char *s)
{
fprintf(stderr, "%s on line %d\n",s,input_line_no);
} 
warn(char *s)
{
fprintf(stderr, "%s\n", s);
}

Here is the test I am trying to run:

/* function definitions interspersed with global declarations and
   function prototypes */

void a( void ), b(int x), c(int x, int y, int z);

int a1( void ), b1(int x), c1(int x, char y, char z, int w);
int x, y[10], z;
int x0, y0, z0[20];

void foo0( void ) {}

void foo1( int x ) {}

char u0, u1[10];
char a2( void ), b2(char x), c2(char x, char y, char z, int w);

extern int a3( void ), b3(int x), c3(int x, char y, char z, int w);

extern char a4( void ), b4(char x), c4(char x, char y, char z, int w);

void foo2( int x, int y, int z ) {}

int foo3( int x[], char y, int z[], char w[] ) {}

int x1, x2[100], x3, x4, x5[1000];
int b5(int x[]), c5(int x, char y[], char z, int w[], int u[], int v);

char b6(char x[]), c6(char x, char y[], char z[], int w);

char foo4( int x[], char y, int z[], char w[] ) {}

extern int a7( void ), b7(int x[]), c7(int x[], char y, char z[], int w[]);

extern char a8( void ), b8(char x[]), c8(char x, char y[], char z, int w[]);

I've tried rewriting my grammar for comments, but I can't seem to get anything but what I have to even slightly work. Any help would be appreciated, thank you!

Upvotes: 1

Answers (2)

rici

Reputation: 241721

(F)lex automatically adds the default fallback rule

<*>.|\n        ECHO;

at the end of your ruleset, so any character not recognised by your rules will get printed on standard output. That's what you are seeing.

This behaviour is rarely what you want in a parser, and I almost always start my flex files with

%option nodefault

[Note 1]

That suppresses the default fallback rule and produces a warning if the rule would have been used by some input. Unfortunately the warning message isn't very explicit about what input might fail to be matched, but if you ignore the warning and use the generated scanner, it will produce a fatal error at run-time on unmatched input.

In this particular case, it's clear that the contents of the comment are not being matched in the COMMENT start condition. Perhaps you meant to use .|\n for the fourth rule? Although that would make the third rule redundant.

Notes:

Actually, I normally use:
```
 %option nodefault noinput nounput noyywrap 8bit yylineno
```
noinput and nounput suppress compiler warnings for unused functions (since I don't normally use those functions); noyywrap avoids the need for yywrap, so flex sends an end of input token as soon as it sees the end of input, and yylineno tells flex to track line numbers, which is convenient for error messages.

8bit is the default as long as you're using the default table settings but if you ask for a "fast" scanner, the default changes to producing undefined behaviour if the input includes a character code greater than 127. I found that out the hard way doing a timing test on the fast table option, so I although I don't usually use that option (it doesn't speed things up much and it makes the tables a lot bigger), it seems prudent to consider the possibility that someone else might want to.

Upvotes: 0

user1456982

Reputation: 133

Your pattern for block comment is basically incorrect for many reasons...

Typically, for block comment, the lexical pattern is this:

"/*"  { BEGIN COMMENT; }

<COMMENT>[^*/]+ { /* ignore anything that is not '*' or '/' */ }
<COMMENT>("*"+)"/" { BEGIN INITIAL; }
<COMMENT>[*/] { /* residual stuff */ }

Upvotes: 0

Issue with multiple line comment grammar - LEX/YACC

Answers (2)

Notes:

Related Questions