Reputation: 159
I want to implement a syntax detector for json file with bison.
Array: LB RB
| LB Values RB
| LB Values COMMA RB error
{ puts("extra comma, recovered"); };
to recognize syntax error like ["extra comma",]
, but it failed. But the similar pattern for Object
works. I think it may relate to the pattern order.Values: Value
| Value COMMA Values
| Value COMMA error
{ puts("extra comma, recovered"); }
it succeeded. I wonder why
Value COMMA error { puts("extra comma, recovered"); }
won't mismatch the text that should be Value COMMA Values
?syntax.y
%{
#include"lex.yy.c"
void yyerror(const char*);
%}
%token LC RC LB RB COLON COMMA
%token STRING NUMBER
%token TRUE FALSE VNULL
%%
Json:
Value
;
Value:
Object
| Array
| STRING
| NUMBER
| TRUE
| FALSE
| VNULL
;
Object:
LC RC
| LC Members RC
| LC Member COMMA RC error { puts("extra comma, recovered"); }
;
Members:
Member
| Member COMMA Members
;
Member:
STRING COLON Value
;
Array:
LB RB
| LB Values RB
| LB Values RC error { puts("unmatched right bracket, recovered"); }
;
Values:
Value
| Value COMMA Values
| Value COMMA error { puts("extra comma, recovered"); }
;
%%
void yyerror(const char *s){
printf("syntax error: ");
}
int main(int argc, char **argv){
if(argc != 2) {
fprintf(stderr, "Usage: %s <file_path>\n", argv[0]);
exit(-1);
}
else if(!(yyin = fopen(argv[1], "r"))) {
perror(argv[1]);
exit(-1);
}
yyparse();
return 0;
}
lex.l
%{
#include"syntax.tab.h"
%}
%option noyywrap
unic u[0-9a-fA-F]{4}
esc \\([\"\\/bfnrt]|{unic})
scp [^"\\\x00-\x1f]
string \"({esc}|{scp})*\"
int 0|[1-9][0-9]*
frac \.[0-9]+
exp [Ee][+-]?[0-9]+
number -?{int}{frac}?{exp}?
empty [ \n\r\t]
%%
"{" { return LC; }
"}" { return RC; }
"[" { return LB; }
"]" { return RB; }
":" { return COLON; }
"," { return COMMA; }
"true" { return TRUE; }
"false" { return FALSE; }
"null" { return VNULL; }
{string} { return STRING; }
{number} { return NUMBER; }
{empty} {}
. { printf("lexical error: %s\n", yytext); }
Upvotes: 0
Views: 139
Reputation: 241861
The error
token is produced automatically by the bison parser when it is unable to do anything with the lookahead token. That enables it to make progress by shifting the error
token, if the current parser state has such a shift action.
A production which contains the error
token cannot be used unless an error
token has been produced.
After shifting the error
token, the parser will attempt to match the rest of the production containing it. If necessary, it will skip input until it finds a token which can be shifted. That can be used to synchronise the parse. So it's pretty common to see patterns like
Array: '[' Values ']'
| '[' Values error ']'
which attempts to match the closing bracket.
Alternatively, you might just restart the parse, in a manner similar to your second example. (Here I use left-recursion, which is generally a better choice in case the list is long, since left recursion doesn't require usage of the parse stack):
Values: Value
| Values ',' Value
| Values error
If you want to produce a specific error for the case where the input includes a trailing comma, your best approach is to just include a normal rule:
Array: '(' Values ',' ')' { /* Signal the error */ }
Good error recognition is an art, not a science. A certain amount of experimentation is almost always necessary.
In most applications, it would be best to simply reject invalid JSON immediately rather than attempting to produce a meaningful error message and much less to continue the parse. JSON is normally machine-generated and should be correct; if it is not, there is no back channel for sending a meaningful error, and no-one to send it to. Incorrectly formatted interchange strings are often attempts to provoke vulnerabilities, and should be treated as such. (Your application may be an exception, of course. But experience shows that immediate rejection is almost always the best strategy, as well as being the simplest.)
Upvotes: 1