Reputation: 7821
As noted in the Wikipedia article on Parsing, the process has three stages:
Other than the small confusion in stage (3) above, I wanted to verify that my understanding of the process is correct for ECMAScript.
Thus, is the below flow correct?
Upvotes: 0
Views: 263
Reputation: 241771
The syntactic parsing does not obey the "maximal munch" rule (select the longest matching prefix). In fact, as far as I know ECMA-262 does not specify a parsing algorithm, but does provide an unambiguous context-free grammar which can be parsed, for example with a bottom-up (LR(k)) parser, aside from some issues dealing with automatic semicolon insertion and some restrictions on productions which span a newline (which is not a syntactic token).
However, as mentioned in §5.1.4, the grammar actually recognises a superset of the language; additional restrictions are provided in the form of supplementary grammars.
One clarification: The complexities related to having multiple context-dependent lexical goal symbols make it difficult to first divide the input into lexemes and only then combine the lexemes into a parse tree. It is impossible to know the correct lexical goal symbol at each point without at least a partial parse, so it is convenient to interleave the syntactic and lexical parses. Practical parsing algorithms operate from left to right, processing lexemes basically in input order, so it is possible to do lexical analysis on demand, only finding a lexeme when the parser needs more input to continue.
But aside from that, the overall structure you outline is correct. In the lexical parse, the longest possible prefix of terminals (characters) are aggregated into a non-terminals to create a lexeme (according to slightly complicated rules about which lexical goal is required); in the syntactic parse, terminals (lexemes) are aggregated into non-terminals to produce a single parse tree corresponding to one of two syntactic goal symbols.
As is often the case with real-world languages, the reality is not quite as clean as that. Aside from the need for the parser to indicate which lexical goal is required, there are also the newline rules and automatic semicolon insertion, both of which cross the boundary between lexical and syntactic parsing.
The use of the words "terminal" and "non-terminal" can be a bit confusing, but I (and the ECMA standard) use them with the standard meaning in a context-free grammar.
A context-free grammar consists of productions, each of which has the form:
N ⇒ S …
where N
is a non-terminal symbol and S
is a possibly-empty sequence of either terminal or non-terminal symbols. Terminal symbols are atoms in the representation of the string to be recognized.
The standard parsing model divides the parse into two levels: lexical and syntactic. The original input is a sequence of characters; lexical analysis turns this into a sequence of lexemes, which are the input to the syntactic parse.
A standard context-free grammar has a single goal symbol, which is one of the non-terminals defined by the grammar. The parse succeeds if the entire input can be reduced to this non-terminal.
A lexical scan can be viewed as a context-free grammar with an ordered list of goal symbols. It tries each goal symbol in turn on successively longer prefixes of the input, and accepts the first goal symbol which matched the longest prefix. (in practice, this is all done in parallel; I'm talking conceptually here.) When ECMA-262 talks about different lexical goals, it really means different lists of possible goal non-terminals.
It's also useful to augment symbols with semantic attributes; these attributes do not influence the parse, but they are useful once the parse is done. In particular, the parse tree is built by attaching a tree node as an attribute to each non-terminal created from a production during the parse, so that the final result of the parse is not the non-terminal symbol as such (that's known before the parse starts) but rather the semantic attributes attached to that particular instance of a non-terminal, while the result of the lexical scan at each point is a non-terminal symbol and its associated semantic attributes; typical, the semantic attribute will be the associated input sequence, or some function of those characters.
In any event, the two-level parse involves feeding the output non-terminals of the lexical level as terminals for the syntactic level.
Upvotes: 4