brandizzi
brandizzi

Reputation: 27090

Validating a "break" statement with a recursive descent parser

In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:

statement → exprStmt
          | ifStmt
          | printStmt
          | whileStmt
          | block ;

block     → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt    → "if" "(" expression ")" statement ( "else" statement )? ;

One of the exercises is to add a break statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if statements etc. if those are inside a loop.

My first approach was to create a new rule, whileBody, to accept break:

## FIRST TRY
statement → exprStmt
          | ifStmt
          | printStmt
          | whileStmt
          | block ;

block     → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
          | break ;
break     →  "break" ";" ;
ifStmt    → "if" "(" expression ")" statement ( "else" statement )? ;  

But we have to accept break inside nested loops, if conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break:

## SECOND TRY
statement → exprStmt
          | ifStmt
          | printStmt
          | whileStmt
          | block ;

block     → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
          | break
          | whileBlock
          | whileIfStmt
whileBlock→  "{" (declaration | break)* "}" ;
whileIfStmt    → "if" "(" expression ")" whileBody ( "else" whileBody )? ;  
break     →  "break" ";"
ifStmt    → "if" "(" expression ")" statement ( "else" statement )? ;  

It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!

I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break outside loops.

TL;DR

My questions are:

  1. Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
  2. Is there a more practical way to bake the break command inside a syntax specification?
  3. Or the standard way is indeed to change a parser to prevent breaks outside loops while parsing?

Upvotes: 0

Views: 568

Answers (2)

C. M. Sperberg-McQueen
C. M. Sperberg-McQueen

Reputation: 25054

Attribute grammars are good at this sort of thing. Define an inherited attribute (I'll call it LC for loop count). The 'program' non-terminal passes LC = 0 to its children; loops pass LC = $LC + 1 to their children; all other constructs pass LC = $LC to their children. Make the rule for 'break' syntactically valid only if $LC > 0.

There is no standard syntax for attribute grammars, or for using attribute values in guards (as I am suggesting for 'break'), but using Prolog definite-clause grammar notation your grammar might look something like the following. I have added a few notes on DCG notation, in case it's been too long since you have used them.

/* nt(X) means, roughly, pass the value X as an inherited attribute. 
** In a recursive-descent system, it can be passed as a parameter.
** N.B. in definite-clause grammars, semicolon separates alternatives,
** and full stop ends a rule.  
*/

/* DCD doesn't have regular-right-part rules, so we have to  
** handle repetition via recursion.
*/ 
program -->
    statement(0);
    statement(0), program.

statement(LC) -->
    exprStmt(LC);
    ifStmt(LC);
    printStmt(LC);
    whileStmt(LC);
    block(LC);
    break(LC).

block(LC) -->
    "{", star-declaration(LC), "}".

/* The notation [] denotes the empty list, and matches zero
** tokens in the input.  
*/
star-declaration(LC) -->
    [];
    declaration(LC), star-declaration(LC).

/* On the RHS of a rule, braces { ... } contain Prolog code.  Here,  
** the code "LC2 is LC + 1" adds 1 to LC and binds LC2 to that value.
*/ 
whileStmt(LC) -->
    { LC2 is LC + 1 }, "while", "(", expression(LC2), ")", statement(LC2).

ifStmt(LC) --> "if", "(", expression(LC), ")", statement(LC), opt-else(LC).

opt-else(LC) -->
    "else", statement(LC);
    [].

/* The definition of break checks the value of the loop count:
** "LC > 0" succeeds if LC is greater than zero, and allows the
** parse to succeed.  If LC is not greater than zero, the expression
** fails.  And since there is no other rule for 'break', any attempt
** to parse a 'break' rule when LC = 0 will fail.
*/
break(LC) --> { LC > 0 }, "break", ";".

Nice introductions to attribute grammars can be found in Grune and Jacobs, Parsing Techniques and in the Springer volumes Lecture Notes in Computer Science 461 (Attribute Grammars and Their Applications*, ed. P. Deransart and M. Jourdan) and 545 (Attribute Grammars, Applications, and Systems, ed. H. Alblas and B. Melichar.

The technique of duplicating some productions in order to distinguish two situations (am I in a loop? or not?), as illustrated in the answer by @rici, can be regarded as a way of pushing a Boolean attribute into the non-terminal names.

Upvotes: 2

rici
rici

Reputation: 241909

  1. Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?

Sure. But you need a lot of duplication. Since while is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B to the name of non-terminals which might include break statements.

declaration    → varDecl
               | statement
declaration_B  → varDecl
               | statement_B
statement      → exprStmt
               | ifStmt
               | printStmt
               | whileStmt
               | block
statement_B    → exprStmt
               | printStmt
               | whileStmt
               | breakStmt
               | ifStmt_B
               | block_B
breakStmt      → "break" ";"
ifStmt         → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B       → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt      → "while" "(" expression ")" statement_B ;
block          → "{" declaration* "}"
block_B        → "{" declaration_B* "}"

Not all statement types need to be duplicated. Non-compound statement like exprStmt don't, because they cannot possibly include a break statement (or any other statement type). And the statement which is the target of a loop statement like whileStmt can always include break, regardless of whether the while was inside a loop or not.

  1. Is there a more practical way to bake the break command inside a syntax specification?

Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.

  1. Is there a different way to do this?

Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt would set that argument to True (or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False. The breakStmt implementation would just return failure if it is called with False.

Upvotes: 1

Related Questions