Reputation: 6390
I'll try to explain my question with an example. Consider the following grammar production in the C++ standard:
literal:
integer-literal
character-literal
floating-point-literal
string-literal
boolean-literal
pointer-literal
user-defined-literal
Once the parser identifies a literal as an integer-literal, I always thought that the parser would just stop there. But I was told that this is not true. The parser will continue parsing to verify whether the literal could also be matched with a user-defined-literal, for example.
Is this correct?
Edit
I decided to include this edit as my interpretation of the Standard, in response to @rici's excellent answer below, although with a result that is the opposite of the one advocated by the OP.
One can read the following in [stmt.ambig]/1 and /3 (emphases are mine):
There is an ambiguity in the grammar involving expression-statements and declarations: An expression-statement with a function-style explicit type conversion as its leftmost subexpression can be indistinguishable from a declaration where the first declarator starts with a (. In those cases the statement is a declaration.
That is, this paragraph states how ambiguities in the grammar should be treated. There are several other ambiguities mentioned in the C++ Standard, but only three that I know are ambiguities related to the grammar, [stmt.ambig], [dcl.ambig.res]/1, a direct consequence of [stmt.ambig] and [expr.unary.op]/10, which explicitly states the term ambiguity in the grammar.
The disambiguation is purely syntactic; that is, the meaning of the names occurring in such a statement, beyond whether they are type-names or not, is not generally used in or changed by the disambiguation. Class templates are instantiated as necessary to determine if a qualified name is a type-name. Disambiguation precedes parsing, and a statement disambiguated as a declaration may be an ill-formed declaration. If, during parsing, a name in a template parameter is bound differently than it would be bound during a trial parse, the program is ill-formed. No diagnostic is required. [ Note: This can occur only when the name is declared earlier in the declaration. — end note ]
Well, if disambiguation precedes parsing there is nothing that could prevent a decent compiler to optimize parsing by just considering that the alternatives present in each definition of the grammar are indeed ordered. With that in mind, the first sentence in [lex.ext]/1 below could be eliminated.
If a token matches both user-defined-literal and another literal kind, it is treated as the latter. [ Example: 123_km is a user-defined-literal, but 12LL is an integer-literal. — end example ] The syntactic non-terminal preceding the ud-suffix in a user-defined-literal is taken to be the longest sequence of characters that could match that non-terminal.
Note also that this paragraph doesn't mention ambiguity in the grammar, which for me at least, is an indication that the ambiguity doesn't exist.
Upvotes: 2
Views: 267
Reputation: 241951
There is no implicit ordering of productions in the C++ presentation grammar.
There are ambiguities in that grammar, which are dealt with on a case-by-case basis by text in the standard. Note that the text of the the standard is normative; the grammar does not stand alone, and it does not override the text. The two need to be read together.
The standard itself points out that the grammar as resumed in Appendix A:
… is not an exact statement of the language. In particular, the grammar described here accepts a superset of valid C++ constructs. Disambiguation rules (8.9, 9.2, 11.8) must be applied to distinguish expressions from declarations. Further, access control, ambiguity, and type rules must be used to weed out syntactically valid but meaningless constructs. (Appendix A, paragraph 1)
That's not a complete list of the ambiguities resolved in the text of the standard, because there are also rules about lexical ambiguities. (See below.)
Almost all of these ambiguity resolution clauses are of the form "if both P and Q applies, choose Q", and thus would be unnecessary were there an implicit ordering of grammar alternatives, since the correct parse could be guaranteed simply by putting the alternatives in the correct order. So the fact that the standard feels the need to dedicate a number of clauses to ambiguity resolution is prima facie evidence that alternatives are not implicitly ordered. [Note 1]
The C++ standard does not explicitly name the grammar formalism being used, but it does credit the antecedents which allows us to construct a historical argument. The formalism used by the C++ standard was inherited from the C standard and the description in Kernighan & Ritchie's original book on the (then newly-minted) C language. K&R wrote their grammar using the Yacc parser generator, and the original C grammar is basically a Yacc grammar file. Yacc uses the LALR(1) algorithm to construct a parser from a context-free grammar (CFG), and its grammar files are a concrete representation of that grammar written in what has come to be known as BNF (although there is some historical ambiguity about what the letters in BNF actually stand for). BNF does not have any implicit ordering of rules, and the formalism does not allow any way to write an explicit ordering or any other disambiguation rule. (A BNF grammar must be unambiguous in order to be mechanically parsed; if it is ambiguous, the LALR(1) algorithm will fail to generate a parser.)
Yacc does go a bit outside of the box. It has some automatic disambiguation rules, and one mechanism to provide explicit disambiguation (operator precedence). But Yacc's disambiguation has nothing to do with the ordering of alternatives either.
In short, ordered alternatives were not really a feature of any grammar formalism until 2002 when Bryan Ford proposed packrat parsing, and subsequently formalised a class of grammars which he called "Parsing Expression Grammars" (PEGs). The PEG algorithm does implicitly order alternatives, by insisting that the right-hand alternative in an alternation only be attempted if the left-hand alternative failed to match. For this reason, the PEG alternation operator (or "ordered alternation" operator) is generally written as /
instead of |
, avoiding confusion with the traditional unordered alternation syntax.
A key feature of the PEG algorithm is that it is always deterministic. Every PEG grammar can be deterministically applied to a source text without ambiguity. (That doesn't mean that the grammar will give you the parse you wanted, of course. It just means that it will never give you a list of parses and let you select the one you want.) So grammars written in PEG cannot be accompanied by textual rules which disambiguate, because there are no ambiguities.
I mention this because the existence and popularity of PEG have to some extent altered the perception of the meaning of the alternation operator. Before PEG, we probably wouldn't be having this kind of discussion at all. But using PEG as a guide to interpreting the C++ grammar formalism is ahistoric and unjustifiable; the roots of the C++ grammar go back to at least 1978, at least a quarter of a century before PEG.
[lex.pptoken]
(§5.4) paragraph 3 lays down the fundamental rules for token recognition, which is a little more complicated than the traditional "maximal munch" principle which always recognises the longest possible token starting immediately after the previously recognised token. It includes two exceptions:
<::
is treated as starting with the token <
rather than the longer token <:
unless it is the start of <::>
(treated as <:
, :>
) or <:::
(treated as <:
, ::
). That might all make more sense if you mentally replace <:
with [
and :>
with ]
, which is the intended syntactic equivalence.[lex-header]
(§5.8) avoids the ambiguity between header-names and string-literals (as well as certain token sequences starting with <
) by requiring header-name to only be recognised in certain contexts, including an #include
preprocessing directive. (The section does not actually say that the string-literal should not be recognised, but I think that the implication is clear.)
[lex.ext]
(§5.13.8) paragraph 1 resolves the ambiguities involved with user-defined-literals, by requiring that:
Note that this rule is not really a tokenisation rule, because it is applied after the source text has been divided into tokens. Tokenisation is done in translation phase 3, after which the tokens are passed through the preprocessing directives (phase 4), rewriting of escape sequences and UCNs (phase 5), and concatenation of string literals (phase 6). Each token which emerges from phase 6 must then be reinterpreted as a token in the syntactic grammar, and it is at that point that literal tokens will be classified. So it's not necessary that §5.13.8 clarify what the extent of the token being categorised is; the extent is already known and the converted token must use exactly all of the characters in the preprocessing token. Thus it's quite different from the other ambiguities in this list, but I left it here because it is so present in the original question and in the various comment threads.
Upvotes: 3
Reputation: 263647
No ordering is either implied or necessary.
All seven kinds of literal are distinct. No token that meets the definition of any of them can meet the definition of any other. For example, 42
is an integer-literal and cannot be a floating-point-literal.
How a compiler determines what a token is is an implementation detail that the standard doesn't address, and doesn't need to.
If there were an ambiguity, so that for example the same token could be either an integer-literal or a user-defined-literal, either the language would have to have a rule to disambiguate it, or it would be a bug in the grammar.
UPDATE: There is in fact such an ambiguity. As discussed in comments, 42ULL
satisfies the syntax of either an integer-literal or a user-defined-literal. This ambiguity is resolved, not by the ordering of the grammar productions, but by an explicit statement:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
Upvotes: 1
Reputation: 474476
The section on syntactic notation in the standard only says this about what it means:
In the syntax notation used in this document, syntactic categories are indicated by italic type, and literal words and characters in
constant width
type. Alternatives are listed on separate lines except in a few cases where a long set of alternatives is marked by the phrase “one of”. If the text of an alternative is too long to fit on a line, the text is continued on subsequent lines indented from the first one. An optional terminal or non-terminal symbol is indicated by the subscript “opt”, so{ expressionopt }
indicates an optional expression enclosed in braces.
Note that the statement considers the terms in grammars to be "alternatives", rather than a list or even an ordered list. There is no statement about ordering of the "alternatives" at all.
As such, this strongly suggests that there is no ordering at all.
Indeed, the presence throughout the standard of specific rules to disambiguate cases where multiple terms match also suggests that the alternatives are not written as a prioritized list. If the alternatives were some kind of ordered list, this statement would be redundant:
If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
Upvotes: 0