Xiaowen Zhang
Xiaowen Zhang

Reputation: 159

What is the order that bison matching the pattern?

Description

I want to implement a syntax detector for json file with bison.

  1. I used
    Array: LB RB
         | LB Values RB
         | LB Values COMMA RB error
              { puts("extra comma, recovered"); };
    
    to recognize syntax error like ["extra comma",], but it failed. But the similar pattern for Object works. I think it may relate to the pattern order.
  2. However, when I use
    Values: Value
          | Value COMMA Values
          | Value COMMA error
               { puts("extra comma, recovered"); }
    
    it succeeded. I wonder why
    Value COMMA error { puts("extra comma, recovered"); }
    
    won't mismatch the text that should be Value COMMA Values?

Source code

syntax.y

%{
    #include"lex.yy.c"
    void yyerror(const char*);
%}

%token LC RC LB RB COLON COMMA
%token STRING NUMBER
%token TRUE FALSE VNULL
%%

Json:
      Value
    ;
Value:
      Object
    | Array
    | STRING
    | NUMBER
    | TRUE
    | FALSE
    | VNULL
    ;
Object:
      LC RC
    | LC Members RC
    | LC Member COMMA RC error { puts("extra comma, recovered"); }
    ;
Members:
      Member
    | Member COMMA Members
    ;
Member:
      STRING COLON Value
    ;
Array:
      LB RB
    | LB Values RB
    | LB Values RC error { puts("unmatched right bracket, recovered"); }
    ;
Values:
      Value
    | Value COMMA Values
    | Value COMMA error { puts("extra comma, recovered"); }
    ;
%%

void yyerror(const char *s){
    printf("syntax error: ");
}

int main(int argc, char **argv){
    if(argc != 2) {
        fprintf(stderr, "Usage: %s <file_path>\n", argv[0]);
        exit(-1);
    }
    else if(!(yyin = fopen(argv[1], "r"))) {
        perror(argv[1]);
        exit(-1);
    }
    yyparse();
    return 0;
}

lex.l

%{
    #include"syntax.tab.h"
%}
%option noyywrap

unic u[0-9a-fA-F]{4}
esc \\([\"\\/bfnrt]|{unic})
scp [^"\\\x00-\x1f]
string \"({esc}|{scp})*\"

int 0|[1-9][0-9]*
frac \.[0-9]+
exp [Ee][+-]?[0-9]+
number -?{int}{frac}?{exp}?

empty [ \n\r\t]

%%

"{" { return LC; }
"}" { return RC; }
"[" { return LB; }
"]" { return RB; }
":" { return COLON; }
"," { return COMMA; }

"true" { return TRUE; }
"false" { return FALSE; }
"null" { return VNULL; }

{string} { return STRING; }
{number} { return NUMBER; }

{empty} {}

. { printf("lexical error: %s\n", yytext); }

Upvotes: 0

Views: 139

Answers (1)

rici
rici

Reputation: 241861

The error token is produced automatically by the bison parser when it is unable to do anything with the lookahead token. That enables it to make progress by shifting the error token, if the current parser state has such a shift action.

A production which contains the error token cannot be used unless an error token has been produced.

After shifting the error token, the parser will attempt to match the rest of the production containing it. If necessary, it will skip input until it finds a token which can be shifted. That can be used to synchronise the parse. So it's pretty common to see patterns like

Array: '[' Values ']'
     | '[' Values error ']'

which attempts to match the closing bracket.

Alternatively, you might just restart the parse, in a manner similar to your second example. (Here I use left-recursion, which is generally a better choice in case the list is long, since left recursion doesn't require usage of the parse stack):

Values: Value
      | Values ',' Value
      | Values error

If you want to produce a specific error for the case where the input includes a trailing comma, your best approach is to just include a normal rule:

Array: '(' Values ',' ')' { /* Signal the error */ }

Good error recognition is an art, not a science. A certain amount of experimentation is almost always necessary.


In most applications, it would be best to simply reject invalid JSON immediately rather than attempting to produce a meaningful error message and much less to continue the parse. JSON is normally machine-generated and should be correct; if it is not, there is no back channel for sending a meaningful error, and no-one to send it to. Incorrectly formatted interchange strings are often attempts to provoke vulnerabilities, and should be treated as such. (Your application may be an exception, of course. But experience shows that immediate rejection is almost always the best strategy, as well as being the simplest.)

Upvotes: 1

Related Questions