lukasz
lukasz

Reputation: 3

How detailed grammar should be - good practices

I have a question related to good practice in defining grammar. I wonder if there are any generally accepted standards as to how detailed a grammar should be. Below are some simplified examples in ANTRL4 and EBNF.

Solution #1 (ANTLR4)

grammar test;

test
    : statement*
    ;

statement
    : s1
    | s2
    ;

s1
    : '(' a_label ')' '-' '[' b_label ']'
    ;

s2
    : '<' a_label '>'
    ;
    
a_label
    : STRING
    ;

b_label
   : STRING
   ;

STRING
    : '"' (~ ["\\\r\n] | '\'' | '\\"')* '"'
    ;

Solution #1 (EBNF)

test     ::= statement*
statement
         ::= s1
           | s2
s1       ::= '(' a_label ')' '-' '[' b_label ']'
s2       ::= '<' a_label '>'
a_label  ::= STRING
b_label  ::= STRING

STRING   ::= '"' ( [^"\#xd#xa] | "'" | '\"' )* '"'

Now let's modify the first solution a bit.

Solution #2 (ANTLR4)

grammar test;

test
    : statement*
    ;

statement
    : s1
    | s2
    ;

s1
    : '(' STRING ')' '-' '[' STRING ']'
    ;

s2
    : '<' STRING '>'
    ;
  

STRING
    : '"' (~ ["\\\r\n] | '\'' | '\\"')* '"'
    ;

Solution #2 (EBNF)

test     ::= statement*
statement
         ::= s1
           | s2
s1       ::= '(' STRING ')' '-' '[' STRING ']'
s2       ::= '<' STRING '>'

STRING   ::= '"' ( [^"\#xd#xa] | "'" | '\"' )* '"'

Changes compared to the solution #1: Removed a_label and b_label and inserted STRING directly.

Solution #3 (ANTLR4)

grammar test;

test
    : statement*
    ;

statement
    : s1
    | s2
    ;

s1
    : '(' label ')' '-' '[' label ']'
    ;

s2
    : '<' label '>'
    ;
    
label
    : STRING
    ;

STRING
    : '"' (~ ["\\\r\n] | '\'' | '\\"')* '"'
    ;

Solution #3 (EBNF)

test     ::= statement*
statement
         ::= s1
           | s2
s1       ::= '(' label ')' '-' '[' label ']'
s2       ::= '<' label '>'
label    ::= STRING

STRING   ::= '"' ( [^"\#xd#xa] | "'" | '\"' )* '"'

Changes compared to the solution #1: Changed a_label and b_label to label.

Which is the most correct solution? Are there any good practices related to this?

UPDATE: a_label and b_label are two different types of labels. From the user's perspective, it is important to know where to put which label. By adding such information to the grammar, no additional analysis of the specification is necessary, everything is clear already by reading the grammar. Additionally, each of these different labels is included as a separate item in the tree. label is more general but it's also more informative than just STRING. But which is more important - better readability or shorter grammar? Which way is more common?

Upvotes: 0

Views: 168

Answers (2)

rici
rici

Reputation: 241911

The point of a grammar (in most cases) is to build a semantic structure (usually a parse tree) from an input. The precise requirements on that structure are highly dependent on the application, but in general terms the names used as syntactic variables (terminals and non-terminals) are about as important algorithmically as are the names of variables in a program: that is, not at all. Good naming practices may help future readers (including you!) better understand the intent of the grammar, but there is no "correct" or "incorrect" answer.

Personally, I usually try to avoid unnecessary unit productions, since they tend to clutter up the parse tree. (You could just discard the unit production when building the parse tree, but then why would you have it in the first place?)

That doesn't mean that it never makes sense to have two different non-terminals with exactly the same derivations. That can be used to apply different semantic actions. For example, it might be useful to mark the use of an identifier as "define" or "use", and that could be done in the grammar by using two different non-terminals whose right-hand sides are identical. On the other hand, it's likely that (at least over time) the two right-hand sides will diverge. For example, a subscript expression can be an "lvalue" while a function call cannot be (at least, in some languages). And on yet a third hand, prematurely marking a production with the expected use case can lead to parsing conflicts, so it's almost always best to at least delay differentiation as late as possible.

With reference to the concrete examples in your question, I'd say that using STRING is far superior to either a_label or b_label (which add no useful information at all). Using label with a unit production would make sense if there was some specific semantic conversion necessary to make a STRING into a label (or some specific semantic check). If not, both options are totally valid, but I'd go for STRING.

Upvotes: 0

Mike Cargal
Mike Cargal

Reputation: 6785

As you've probably guessed, this is going to be very much an "it depends" sort of answer.

Some factors to keep in mind:

rules like a_label, b_label, and label will create deeper parse trees (the parser will create new nodes for each rule as is descends). It's a good idea to use either a plug-in or grun -gui ... to get a visualization of your parse tree, and test out input against grammar variations. You'll be writing code to process that parse tree, so getting it in the most convenient structure for your code is helpful.

Just a note here that, once you understand listeners and visitors well, these intermediate nodes/levels in your tree may not be a bothersome as you suspect. I still prefer to keep the resulting tree as simple as possible. You'll also find a LOT of people of the opinion that you should convert from a parse tree to an AST as early as possible and do your work from there (I think that listeners and visitors can take a lot of parse tree pain away, once you understand how they work).

There are also situations where you'll find utility in having that extra level. Without getting into too much detail, if you're trying to do code completion, you may need to know whether an ID is a destination or a source ID:

assign: ID '=' ID;

works fine, and gives you a very clean parse tree

assign: assignDest '=' value;
assignDest: ID;
value: ID | NUMBER | STRING;

will give you a deeper parse tree, but if you're using something like the excellent ANTLR Code completion Core, you'll be able to tell it to return assignDest and value to you instead of just ID. So, in this case, there's definite value in the extra detail.

Opinion follows --

I think many people fall prey to ANTLRs appearance of documenting a languages syntax, and attempt to put "all the things" in the grammar. I find I get much farther by having "just enough" grammar to have ANTLR provide a parse tree that accurately represents the structure of the source input, and then use listeners an/or visitors to process from there. (Occasionally, adding a few things like the extra level of rules to make code completion work better.).

Start as simple as you can to get an unambiguous parse tree and add from there.

Upvotes: 0

Related Questions