Reputation: 552
I am trying to write a LBNF/BNFC grammar for a C-like language. In C there are many possible modifiers that you may or may not write in front of a declaration (like inline
, const
, volatile
and so on).
I am trying to write my grammar to reuse code and make the resulting Haskell AST easy to consume. A grammar for types could look like this:
rule TypeName ::= "bool" | "int" | "double" | "void" | Id ;
Type. Type ::= TypeQualifier TypeName;
ConstModifier. TypeModifier ::= "const" ;
VolatileModifier. TypeModifier ::= "volatile" ;
NoModifier. TypeModifier ::= ;
And for a function declaration it might look like this:
Fun. Fun ::= FunModifier Type Id "(" [Param] ")" ";" ;
InlineModifier. FunModifier ::= "inline" ;
NoFunModifier. FunModifier ::= ;
The problem with this is I'm getting a ton of shift/reduce and sometimes even reduce/reduce conflicts due to these optional prefixes. An alternative grammar that avoids these conflicts could look like this:
NotInlinedFun. Fun ::= Type Id "(" [Param] ")" ";" ;
InlinedFun. Fun ::= "inline" Type Id "(" [Param] ")" ";" ;
or
NotInlinedFun. Fun ::= FunRest
InlinedFun. Fun ::= "inline" FunRest;
FunRest. FunRest ::= Type Id "(" [Param] ")" ";" ;
which leads to a Haskell AST like this:
data Fun = AFun FunRest | BFun FunRest | CFun FunRest
data FunRest = FunRest Type Id [Param]
instead of the more appealing
data Fun = Fun Modifier Type Id [Param]
data Modifier = A | B | C
You can see how this could quickly lead to a combinatorial explosion of rules or a Haskell AST that won't be pleasant to use.
How can I best avoid these conflicts?
Upvotes: 0
Views: 432
Reputation: 241671
When you see nothing before an int
, you don't know whether that nothing is the absence of a variable modifier or the absence of a function modifier, precisely because you don't yet know whether the int
refers to a variable or the return value of a function. So if the parser is working only with a single token of lookahead, you must avoid forcing it to make a decision.
Fabricating a non-terminal out of nothing is a form of forcing the parser to decide what kind of nothing is being examined, so that, too, must be avoided. But that's not the only example; had you included static
, for example, you would find that trying to classify it as a variable modifier or a function modifier would lead to the same (reduce/reduce) conflict.
But in any case, the real C grammar is more subtle. For example, the following is legal:
static inline const int* extract(int arg);
And so is this:
/* The second const is irrelevant to this discussion. */
volatile const unsigned char* const reg = 0x01A4;
So a declaration can have a lot of qualifiers, not just zero or one. In some cases, repetition makes a difference:
long long very_wide;
In other cases, it doesn't:
inline inline int f(void);
While these constraints could be expressed as a context-free grammar, I've never seen it done; as you say, the exponential explosion is unmanageable. The actual C grammar, as described in the C standard, does not attempt this feat; it simply allows a declaration to contain an arbitrarily order of possibly repeated declaration-specifiers (see §6.7) and then forces the semantic analysis to distinguish between correct and incorrect sequences.
Upvotes: 1