c++ parser and formatter using a single grammar declaration

Question

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.

A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar. But is there a way to use the same grammar to generate the formatting code?

I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.

could I use the abstract syntax tree? boost::spirit? metaprogramming? anyone tried this?

thanks

rici · Accepted Answer

It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.

There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.

Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.

In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.

The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)

Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.

It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.

Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)

One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:

// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;

Or the common case of trailing comments used to document variables:

   /* Tracking the current allocation */
   int   needed_;      // Bytes required.
   int   current_;     // Bytes currently allocated.
// int adjustment_;    // (TODO) Why is this needed?
   /* Either points to the current allocation, or is 0 */
   char* buffer_;

In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.

Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.

c++ parser and formatter using a single grammar declaration

Answers (2)

Related Questions