Can a lexical grammar be defined using productions?

Question

I'm watching a course on compilers that uses regexps to define a grammar for lexical analyzer (lexer) which uses this regexp to configure finite automaton. Only when the course gets to the syntax analyzer (parser) section the terms specific to context-free grammar like productions, terminals and non-terminals are introduced. Which leads to a conclusion that these are only used during syntax analysis.

However, the EcmaScript spec defines productions for the lexer:

Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

So can the grammar for the lexer also be specified using productions with terminal and non-terminal symbols and not solely regexps?

This grammar tutorial also mentions that context-free-grammar (CFG) is a superset of Regular Grammar so it seems that the answer to my question is "yes". I'd appreciate a little bit of explanation. Thanks in advance

torek · Accepted Answer

This has already been answered in comments, but unless it's a duplicate, it probably deserves an answer.

As in your own link, yes, any parser that handles a context-free grammar can also handle a regular grammar. Context-free grammars are those recognized by a pushdown automaton (PDA), which is a strict superset¹ of those recognized by a finite state machine (FSM). There is a good chapter on this here, for instance.

(Note that a regular language, which I will call "RL" here although that's not a common abbreviation, is one that can be recognized by an FSM. A context free language or CFL is one that can be recognized by a PDA. Meanwhile a regular grammar, or RG, is one that recognizes regular languages, and a context-free grammar, or CFG, is one that recognizes context-free languages. CFL, CFG, PDA, and FSM are all common and standard abbreviations; it's the RL and RG abbreviations I am using here that are not. We also have RE or R.E., standing for Regular Expression, which is how we write our regular language—our RL—that we use to build the FSM that recognizes "words" in this RL.)

The reason we generally come up with some RL and write a small regular-expression-based FSM recognizer for this to turn input into tokens, then parse the tokens with a context-free grammar, is that this makes for smaller, simpler code construction. Compilers that do not do extensive optimization tend to spend much of their time in the tokenizing code (though obviously this varies from language to language: e.g., some languages have complex type systems that can cause them to spend most of their time just doing type matching), so to the extent that we can make this small and fast, we come out ahead.

As Bergi notes, the specific issue you're looking at in ECMAScript is the syntax by which the source language expresses a regular expression. (Is that sentence too convoluted? :-) ) While a regular expression is something that can be recognized by an FSM (rather than a PDA), the syntax we use to write the regular expression need not itself be an RL—and in ECMAScript, it isn't.

In fact, using a mix of RL and CFL to encode REs is pretty typical. For instance, in C, C++, and Python, RE handling is in a library, and we send the library routine a string, which it "compiles" into some FSM. The string itself encodes the RE in a CFL, so the (runtime) compilation into an FSM can fail. Some other languages, such as awk, raise the RE syntax into the grammar, but come up with a surrounding regular language in which to express it—such as /pattern/ with any embedded slashes within the pattern being encoded as \/. By encoding the RE in an RL, these languages allow their FSM-based tokenizers to turn the entire RE into a single token, without using a CFG. The token can then be compiled at compile time rather than run time (using a separate subroutine that takes the token apart again), and any compile-time grammar violation can be produced at awk "compile" time (which is runtime anyway).

ECMAScript simply chooses to be different. Its language for its REs is not an RL, but is a CFL, so only a PDA, not an FSM, can recognize it. Whether it could be handled in an awk-like fashion, I'm not sure: the CFL part is defined in section 21.2.1, and there is a surrounding RL in section 11.8.5. As Bergi notes again, the embedded CFL requires a PDA / CFG (e.g., the syntax in 21.2.1 has balanced parentheses: a disjunction embeds additional disjunctions, and requires a closing parenthesis afterward).

(Whew! Too many TLAs!² :-) )

¹This strict superset means there are many CFLs that cannot be recognized by an FSM. A classic example is that FSMs cannot balance an arbitrary number of parentheses, while PDAs can. If you pick some upper limit, such as "no more than 100 nested parentheses", you can generate a (rather alarmingly large) FSM that recognizes all inputs that have no more than 100 nested parentheses, yet nonetheless have balanced parentheses.

²TLA: Three (or occasionally Two) Letter Acronym. See also ETLA, Extended Three Letter Acronym.

Can a lexical grammar be defined using productions?

Answers (1)

Related Questions