J-W
J-W

Reputation: 408

c++ parser and formatter using a single grammar declaration

I have this idea to be able to 'declare' a grammar and use the same declaration for generating the format function.

A parser generator (e.g. antlr) is able to generate the parser from a bnf grammar. But is there a way to use the same grammar to generate the formatting code?

I just want to avoid manually having to sync the parsing code (generated) with a manually written formatting code, since the grammar is the same.

could I use the abstract syntax tree? boost::spirit? metaprogramming? anyone tried this?

thanks

Upvotes: 0

Views: 625

Answers (2)

Ira Baxter
Ira Baxter

Reputation: 95334

OP asks, Has anyone tried this?

Yes. Our DMS Software Reengineering Toolkit can do this: you give it just a grammar, you get a parser that builds ASTs and you get a prettyprinter. We've used this on lots of parse/changeAST/unparse tasks for many language over the last 20 years, preserving the meaning of the source program exactly.

The process is to parse according to the grammar, build an AST, and then walk the AST to carry out prettyprinting operations.

However, you don't get a good prettyprinter. Nice layout of the reformatted source code requires that language cues for block nesting (e.g., matching '{' ... '}', 'BEGIN' ... 'END' pairs, special keywords 'if', 'for', etc.) be used to drive the formatting and indentation. While one can guess what these elements are (as I just did), that's just a guess and in practice a human being needs to inspect the grammar to determine which things are cues and how to format each construct. (The default prettyprinter derived from the grammar makes such guesses).

DMS provides support for that problem in the form of prettyprinter declarations woven into the grammar to provide the formatter engineer quite a lot of control over the layout. (See this SO answer for detailed discussion: https://stackoverflow.com/a/5834775/120163) This produces (our opinion) pretty good prettyprinters. And DMS does have an explicit grammar/formatter for full C++14. [EDIT Aug 2018: full C++17 in MS and GCC dialects)]

EDIT: rici's answer suggests that comments are difficult to handle. He's right, in the sense that you must handle them, and yes, it is hard to handle them if they are removed as whitespace while parsing. The essense of the problem is "removed as whitespace"; and goes away if you don't do that. DMS provides means to capture the comments (rather than ignoring them as whitespace) and attach them (automatically) to AST nodes. The decision as to which AST node captures the comments is handled in the lexer by declaring comments as "pre" (happening before a token) or "post"; this decision is heuristic on the part of the grammer/lexer engineer, but works actually pretty well. The token with comments is passed to the parser, which builds an AST node from it. With comments attached to AST nodes, the prettyprinter can re-generate them, too.

Upvotes: -2

rici
rici

Reputation: 241701

It's not clear to me whether this question is looking for an existing product or library (in which case, the question would be out-of-scope for Stack Overflow), or in algorithms for automatically generating a pretty printer from (some formalism for) a grammar. Here, I've tried to provide some pointers for the second possibility.

There is a long history of research into syntax-directed pretty printing, and a Google or Citeseer search on that phrase will probably give you lots of reading material. I'd recommend trying to find a copy of Derek Oppen's 1979 paper, Prettyprinting, which describes a linear-time algorithm based on the insertion of a few pretty-printing operators into the tokenized source code.

Oppen's basic operators are fairly simple: they consist of indications about how code segments are to be (recursively) grouped, about where newlines must and might be inserted, and about where in a group to increase indentation depth. With the set of proposed operators, it is possible to create an on-line algorithm which prefers to break lines higher up in the parse tree, avoiding the tendency to over-indent deeply-nested code, which is a classic failing of naïve indentation algorithms.

In essence, the algorithm uses a two-finger solution, where the leading finger consumes new tokens and notices when the line must be wrapped, at which point it signals the trailing finger. The trailing finger then finds the earliest point at which a newline could be inserted and all the additional newlines and indents which must be inserted to conform with the operators, advancing until there is no newline between the fingers.

The on-line algorithm might not produced optimal indentation/reflowing (and it is not immediately obvious what the definition of "optimal" might be); for certain aspects of the pretty-printing, it might be useful to think about the ideas in Donald Knuth's optimal line-wrapping algorithm, as described in his 1999 text, Digital Typography. (More references in the Wikipedia article on line wrapping.)

Oppen's algorithm is not perfect (as indicated in the paper) but it may be "good enough" for many practical purposes. (I note some limitations below.) Tracing the citation history of this paper will give you a number of implementations, improvements, and alternate algorithms.

It's clear that a parser generator could easily be modified to simply insert pretty-printing annotations into a token stream, and I believe that there have been various attempts to create yacc-like pretty-printer generators. (And possibly ANTLR derivatives, too.) The basic idea is to embed the pretty printing annotations in the grammar description, which allows the automatic generation of a reduction action which outputs an annotated token stream.

Syntax-directed pretty printing was added to the ASF+SDF Meta-Environment using a similar annotation system; the basic algorithm and formalism is described by M.G.J. van der Brand in Pretty Printing in the ASF+SDF Meta-environment Past, Present and Future (1995), which also makes for interesting reading. (ASF+SDF has since been superseded by the Rascal Metaprogramming Language, which includes visualization tools.)

One important issue with syntax-directed pretty printing algorithms is that they are based on the parse of a tokenized stream, which means that comments have already been removed. Clearly it is desirable that comments be retained in a pretty-printed version of a program, but correctly attaching comments to the relevant code is not trivial, particularly when the comment is on the same line as some code. Consider, for example, the case of a commented-out operation embedded into code:

// This is the simplified form of actual code
int needed_ = (current_ /* + adjustment_ */ ) * 2;

Or the common case of trailing comments used to document variables:

   /* Tracking the current allocation */
   int   needed_;      // Bytes required.
   int   current_;     // Bytes currently allocated.
// int adjustment_;    // (TODO) Why is this needed?
   /* Either points to the current allocation, or is 0 */
   char* buffer_;

In the above example, note the importance of whitespace: the comments may apply to the previous declaration (even though they appear after the semicolon which terminates it) or to the following declaration(s), mostly depending on whether they are suffix comments or full-line comments, but the commented-out code is an exception. Also, the programmer has attempted to line up the names of the member variables.

Another problem with automated syntax-directed pretty-printing is handling incorrect (or incomplete) programs, as would need to be done if the pretty-printing is part of a Development Environment. Error-handling (and error recovery) is by far the most difficult part of automatically-generated parsers; maintaining useful pretty printing in this context is even more complicated. It's precisely for this reason that most IDEs use a form of peephole pretty-printing (another possible search phrase), or even adaptive pretty-printing where user indentation is used as a guide to the location of as-yet-unwritten code.

Upvotes: 3

Related Questions