Reputation: 63
Currently I'm working on a source-to-source compiler and I have already written a bison parser that creates the AST for the input correctly. I need to do several transformations on the syntax tree now and therefore I need to insert many nodes to the tree.
I could create all the structs/unions that I want to add to the syntax tree manually, but this seems to be very much work.
It would be much easier for me to create a string and I want this string to be parsed by the parser I already have. The parser should then return the tree for this string, which I can insert in my original syntax tree.
Unfortunately, the string cannot be parsed with the start rule of my parser, since it must be parsed by a subrule (e.g. my parser parses a list of functions containing statements, the string is a single statement).
How can I make bison parse a string, starting with a rule different from the start rule?
Thanks in advance!
Upvotes: 4
Views: 2714
Reputation: 58598
Fake tokens do not quite solve the problem. When you have some sub-rule of a language that you would like to parse, you probably also have the requirement that you would like to invoke that rule repeatedly, (with multiple calls to yyparse
).
And that requires you to "prime" the parser with the fake tokens which cause it to return to the interesting rule on each call and maintain the correct state of the actual non-fake token stream. Plus, you need a way to detect that the yyparse
call has encountered EOF.
As well, ideally you want to be able to mix calls with yyparse
and other operations on the stream, which means having precise control over the lookahead that is performed by your Flex + Yacc combo.
I solved all these issues in the parser for the TXR language. In this language, there are some interesting sub-languages: a Lisp syntax, as well as a regex syntax.
The problem is to provide a Lisp-reading function which extracts a single object from a stream and leaves the stream in a reasonable state (with regard to lookahead). For instance, suppose the stream contains this:
(a b c d) (e f g h)
We prime the parser with the fake token needed to reach the Lisp sub-grammar. Then call yyparse
. When yyparse
is done, it will consume everything up to here:
(a b c d) (e f g h)
^ stream pointer is here
^ the lookahead token is the parenthesis
After this call, if someone calls a function to get a character from the stream, they will get the e
, and not the (
parenthesis, unfortunately.
Anyway, so we called yyparse
, got the (a b c d)
Lisp object, and the stream pointer is at the e
, and the lookahead token is (
.
The next time call yyparse
, it will ignore this lookahead token and we will get a misparse. We must not only prime the parser with the fake pseudo-tokens that cause it to parse a Lisp expression, but we also must get it to start parsing at the (
lookahead token.
The way to do this is to insert this token into the priming stream.
In the TXR parser, I implement a token stream object which can take up to four tokens of pushback. When yylex
is called, tokens are pulled from this pushback, and only when it is empty is real lexing performed.
This is used in the prime_parser
function:
void prime_parser(parser_t *p, val name, enum prime_parser prim)
{
struct yy_token sec_tok = { 0 };
switch (prim) {
case prime_lisp:
sec_tok.yy_char = SECRET_ESCAPE_E;
break;
case prime_interactive:
sec_tok.yy_char = SECRET_ESCAPE_I;
break;
case prime_regex:
sec_tok.yy_char = SECRET_ESCAPE_R;
break;
}
if (p->recent_tok.yy_char)
pushback_token(p, &p->recent_tok);
pushback_token(p, &sec_tok);
prime_scanner(p->scanner, prim);
set(mkloc(p->name, p->parser), name);
}
The parser's recent_tok
member keeps track of the most recently seen token, which gives us access to the lookahead token from the most recent parse.
To get control over yylex
, I implemented this hack in parser.l
:
/* Near top of file */
#define YY_DECL \
static int yylex_impl(YYSTYPE *yylval_param, yyscan_t yyscanner)
/* Later */
int yylex(YYSTYPE *yylval_param, yyscan_t yyscanner)
{
struct yyguts_t * yyg = convert(struct yyguts_t *, yyscanner);
int yy_char;
if (yyextra->tok_idx > 0) {
struct yy_token *tok = &yyextra->tok_pushback[--yyextra->tok_idx];
yyextra->recent_tok = *tok;
*yylval_param = tok->yy_lval;
return tok->yy_char;
}
yy_char = yyextra->recent_tok.yy_char = yylex_impl(yylval_param, yyscanner);
yyextra->recent_tok.yy_lval = *yylval_param;
return yy_char;
If the token pushback index is nonzero, we pop the canned token and return it to Yacc. Otherwise we call yylex_impl
, the real lexer.
And note how, when we do that, we also peek at what the lexer returned and stash it in recent_tok.yy_char
and recent_tok.yy_lval
.
(What if yy_lval
is a heap-allocated object type? Good thing we have garbage collection in this project!)
In the rules which match these sub-languages, I had to use YYACCEPT
. And note the byacc_fool
business: that was needed to get the hacks to work with Berkeley Yacc. (T.E. Dickey's maintained version of it which supports the Bison reentrant parser scheme.)
spec : clauses_opt { parser->syntax_tree = $1; }
| SECRET_ESCAPE_R regexpr { parser->syntax_tree = $2; end_of_regex(scnr);
| SECRET_ESCAPE_E n_expr { parser->syntax_tree = $2; YYACCEPT; }
byacc_fool { internal_error("notreached"); }
| SECRET_ESCAPE_I i_expr { parser->syntax_tree = $2; YYACCEPT; }
byacc_fool { internal_error("notreached"); }
| SECRET_ESCAPE_E { if (yychar == YYEOF) {
parser->syntax_tree = nao;
YYACCEPT;
} else {
yybadtok(yychar, nil);
parser->syntax_tree = nil;
} }
| SECRET_ESCAPE_I { if (yychar == YYEOF) {
parser->syntax_tree = nao;
YYACCEPT;
} else {
yybadtok(yychar, nil);
parser->syntax_tree = nil;
} }
| error '\n' { parser->syntax_tree = nil;
if (parser->errors >= 8)
YYABORT;
yyerrok;
yybadtok(yychar, nil); }
;
}
Why the YYACCEPT
? I don't remember; good thing we have detailed ChangeLog messages:
* parser.y (spec): Use YYACCEPT in the SECRET_ESCAPE_E clause for
pulling a single expression out of the token stream. YYACCEPT
is a trick for not invoking the $accept : spec . $end production
which is implicitly built into the grammar, and which causes
a token of lookahead to occur. This allows us to read a full
expression without stealing any further token: but only if the
grammar is structured right.
I think this comment is slightly misleading by omission. The implicit $end
production causes a problem that is more than unwanted lookahead: it is looking ahead because actually wants to match EOF. I seem to remember that the YYACCEPT
is a way of bailing out of the parser so that it doesn't throw a syntax error when the next token isn't the $end
token, which is a built-in representation of EOF.
Yacc ends up looking ahead anyway; what we don't want it to do is to raise a syntax error because the lookahead isn't the end-of-file, as expected by the rule. When we have
(a b c d) (e f g h)
and we have a simple grammar rule that matches an expression, the (e f g h)
stuff looks like stray material, and that is a syntax error! After the parser obtains the first )
token, it calls yylex
again and gets (
which is a syntax error; it wants yylex
to indicate EOF at that point. YYACCEPT
is a workaround for this. We let Yacc call yylex
and pull out the second (
, and make note of it so we can push it back in the next yyparse
call; but we prevent Yacc from having a fit over this token.
Upvotes: 0
Reputation: 8759
Given what you seem to be willing to do, you might find interesting ideas to recycle from this paper.
It provides means to support concrete syntax from C++ code, things like (here, desugaring from the Bison parser itself):
exp: exp "&" exp
{
// Sub−ASTs are used as−is (they are not printed, then parsed,
// as in the previous example). Spaces are no longer critical.
$$ = parse(Tweast() <<
"if" << $1 << "then" << $3 << "<> 0 else 0");
};
Upvotes: 2
Reputation: 241771
There is a simple hack, which is described in the bison FAQ..
Basically, for each non-terminal which you would like to be able to use, you create one pseudo-token, and then you create a "meta-start" non-terminal which selects the terminal you want to use:
%token START_PROGRAM
%token START_STATEMENT
%token START_EXPRESSION
%start meta_start
%%
meta_start: START_PROGRAM program
| START_STATEMENT statement
| START_EXPRESSION expression
;
(In the action for each of those productions, you would save the value of $2
somewhere that your caller can get at it.)
Now you just need to arrange for your lexer to deliver the correct start token. You can do this by using a pure parser and a pure lexer and delivering the message through a shared datastructure. That would be the best way to do it, but for the purposes of this answer I'll just show how to do it with a global variable, since the principle is the same:
extern int start_token;
// If all your start non-terminals produce a value of the same type, possibly a union
// type, you could arrange to return it instead of the error report.
int yyparse(int start) {
// I left out the code to cause the lexer to scan a given string.
start_token = start;
return real_yyparse();
}
int yylex() {
int rv = start_token;
if (start_token)
start_token = 0;
else
rv = real_yylex();
return rv;
}
Upvotes: 9