Reputation: 123
I recently learned that C does not have a context-free grammar. I also recently learned that gcc used to use yacc to parse C. The manual for the the yacc utility states "The class of specifications accepted [by yacc] is a very general one: LALR(1) grammars with disambiguating rules", while Wikipedia states that LALR grammars are a subset of deterministic context-free grammars, which are a subset of the context-free grammars. If C is not even context-free (much less a deterministic context-free language), and yet yacc can parse C, then what class of languages can yacc parse, if not the subset of context-free languages that have LALR(1) grammars?
Upvotes: 4
Views: 799
Reputation: 58578
C doesn't have a context free grammar only for the trivial reason that the semantic classification of an identifier token (its lexical category) sometimes requires understanding how it is declared. C is designed for one pass compiling, so that at any point in the parse, everything that is relevant to parsing is known from prior declarations. The declarations in scope can be used to assign a lexical category to a token.
So for instance if the parser is facing (A)(B)
in the middle of a statement block, this could be:
expression (B)
being cast to type A
.
argument list (B)
being applied to function expression (A)
.
But this ambiguity doesn't have to arise in the parser because the lexical analyzer can peek into the scope, and classify A
differently based on whether it is a typedef
name, or something else, and these differently classified identifiers can then be targeted by separate grammar rules. It's like having a magic oracle that tags tokens with semantic info, so then the context-free technique can be applied.
Another problem, in the first place, in C is that it has a preprocessor. The grammar of C is specified in separate pieces: there is a preprocessor grammar, and there is a grammar for the preprocessed token stream. There can't be a context-free grammar for C as such which captures the nuances of its phrase structure, because preprocessing can redefine the syntax, and macros can be called anywhere, except within comments and string literals.
Upvotes: 1
Reputation: 241721
Yacc generates computer programs, which are pretty well Turing complete. The yacc framework uses an LALR(1) framework to trigger actions, but the actions are arbitrary code.
Moreover, the input to yacc is a stream of tokens, not a direct input. The stream of tokens is produced by another computer program written in a Turing complete language, which can also manipulate its input in ways not restricted to context-free transcoding.
Finally, nothing stops a yacc-generated parser from initially accepting a superset of the intended language and then later analysing the context-free parse tree and rejecting certain constructs based on arbitrary computations, such as insisting that variables be declared before use (a context-sensitive computation).
In short, real-world parsers are pragmatically written programs, not theoretical academic exercises. Languages parsed by bison/yacc are generally "mostly" LALR(1), and their lexical analysis is generally "mostly" regular, but don't be surprised when the computer programs exploit their full power to transcend these limits.
That's what makes programming the interesting activity it is.
None of that makes the academic theory less useful. Bison/yacc and other parser generators take a lot of the gruntwork out of building parsers, because they can handle "most of" the analysis. And the closer a language comes to the analysable context-free model, the easier it is to generate ("most of") other useful tools: linters, syntax highlighters, reformatters, indexers, static analysers, documentation extractors, etc., etc. To say nothing of functioning as documentation for the syntax of the language itself.
Upvotes: 5