Understanding and Writing Parsers

Question

I'm writing a program that requires me to create my first real, somewhat complicated parser. I would like to understand what parsing algorithms exists, as well as how to create a "grammar". So my question(s) are as follows:

1) How does one create a formal grammar that a parser can understand? What are the basic components of a grammar?

2) What parsing algorithms exists, and what kind of input does each exceed at parsing?

3) In light of the broad nature of the questions above, what are some good references I can read through to understand the answer to questions 1 and 2?

I'm looking for more of a broad overview with the keywords/topic areas I need so I can look into the details myself. Thanks everybody!

blazs · Accepted Answer

You generally write a context-free grammar G that describes a certain formal language L (e.g. the set of all syntactically valid C programs) which is simply a set of strings over a certain alphabet (think of all well-formed C programs; or of all well-formed HTML documents; or of all well-formed MARKDOWN posts; all of these are sets of finite strings over certain subsets of the ASCII character set). After that you come up with a parser for the given grammar---that is, an algorithm that, given a string w, decides whether the string w can be derived by the grammar G. (For example, the grammar of the C11 language describes the set of all well-formed C programs.)

Some types of grammars admit simple-to-implement parsers. An example of grammars that are often used in practice are LL grammars. A special subset of LL grammars, called the LL(1) grammars, have parsers that run in linear time (linear in the length of the string we're parsing).

There are more general parsing algorithms---most notably the Early parser and the CYK algorithm---that take as inpuit a string w and a grammar G and decide in time O(|w|^3) whether the string w is derivable by the grammar G. (Notice how cool this is: the algorithm takes the grammar as an agrument. But I don't think this is used in practice.)

I implemented the Early parser in Java some time ago. If your're insterested, the code is available on GitHub.

For a concrete example of the whole process, consider the language of all balanced strings of parenthesis (), (()), ((()))()(())(), etc. We can describe them with the following context-free grammar:

S -> (S) | SS | eps

where eps is the empty production. For example, we can derive the string (())() as follows: S => SS => (S)S => ((S))S => (())S => (())(S) => (())(). We can easily implement a parser for this grammar (left as exercise :-).

A very good references is the so-called dragon book: Compilers: Principles, Techniques, and Tools by Aho et al. It covers all the essential topics. Another good reference is the classic book Introduction to Automata Theory, Languages, and Computation by Hopcroft et al.

Understanding and Writing Parsers

Answers (1)

Related Questions