Reputation: 125

How can I find out the unexpected token in a C++ bison parser?

I am using bison/flex to develop a parser in c++ for an expression that a user can type into a field in a gui. I would like to be able to give feedback to the user about allowed tokens (basically autocomplete) as they are typing. The information that '%error-verbose' generates would be sufficient, but it is only available as a string. Is there a way I can get programmatic access to the unexpected token and the expected token list while processing a parse error?

Upvotes: 1

Answers (2)

Dan

Reputation: 125

I ended up customising the skeleton that bison uses to give me access to the information I wanted. This was a hack because as @rici says in his answer bison doesn't give public access to the information I was interested in. I modified the error function to take yytoken and yystate as extra parameters, the same variables that are passed into yysyntax_error_. I then used the same algorithm that yysyntax_error_ uses to generate to generate its 'verbose' message to produce a list of expected tokens and pass them back to the driving program. It's messy, but for my simple grammar at the moment it achieves what I wanted.

Upvotes: 1

rici

Reputation: 241921

The token itself is in the variable yychar. That part is easy.

Finding the list of possibilities is trickier.

Conceptually, you could reparse the current input up to but not including the erroneous token; save the parser state; and then try every other possible token in turn to see if an error is produced.

The reason you need to reparse is that LALR parsers may execute erroneous reductions before encountering a syntax error. (They never execute erroneous shifts, though.) In order to discover the valid lookaheads for the parser state, these reductions would have to be undone, and there is no mechanism for doing that. In general, a reduction loses information so it is not even theoretically possible.

If you enable LAC (q.v.), which you need to do in order to get precise errors, error-verbose parsers avoid the reduction issue by doing an exploratory parse (without reduction actions) on every token which might trigger an incorrect reduction. If this parse fails, then the parser state is available for constructing the list of options; if it succeeds, then it is redone with reduction actions.

Unfortunately, bison does not provide an API for "copy the parser state"; you could reverse-engineer one easily enough, but that would be pretty fragile. So if you wanted to try this without access to the generated parser's internals, you would actually have to reparse the input many times, once for each possible lookahead token.

You could use a canonical LR parser, which has the property that errors are detected before any reduction. Full LR parsing tables can be enormous, but if your grammar is simple enough this might not be a problem. However, you still have no clean way to save parser state, so unless you reverse-engineer that, you would still have to reparse for every successful lookahead token. (Or enough of them to construct a valid error message. Bison's verbose error setting will only output a maximum of five possibilities, and for good reason.)

Possibly the simplest solution is to parse the bison error message, which has a simple fixed format. If you were to do this, I would recommend making your token names simple easily-parsed words and substituting human-readable text in your yyerror handler.

Enabling LAC definitely slows down a parse. In general, all precise error-detection and reporting modifications slow down parsers, sometimes even noticeably; this includes keeping position information (although that is also useful for debug output, so in practice it may be necessary anyway).

The recommendation I always give, because it has worked well for me in practice, is to build two parsers: one which is optimized for error-free code and makes no attempt to do anything more than reject input on the first error, and another (possibly much slower) one which can handle error detection and recovery. Erroneous inputs are then parsed twice, once with the quick parser and then again with the slow one; correct inputs only need to be parsed once and only with the quick parser. This makes project builds rapid, and normally doesn't slow down the initial write-"compile"-edit loop much, as long as the fast parser is actually fast. Keeping the two parsers in synch can be annoying, but most of the time the error-recovery parser only requires some additional methods which can be turned into no-ops and then optimized away in the fast parser. With this strategy, you might be able to use the fast parser to do the "legal lookahead" generation, and it might turn out to be fast enough.

As always, YMMV. Good luck.

Upvotes: 4

How can I find out the unexpected token in a C++ bison parser?

Answers (2)

Related Questions