Matt Bierner
Matt Bierner

Reputation: 65543

Parser Combinators: Handling Whitespace In Parser without excessive Backtracking

I am moving from a separate lexer and parser to a combined parser that operates on an array of characters. One problem is how to properly handle whitespace.

Problem

Take a language consisting of a sequence of characters 'a' and 'b'. Whitespace is allowed in the input but does not effect the meaning of the program.

My current approach to parse such a language is:

var token = function(p) {
    return attempt(next(
         parse.many(whitespace),
         p));
};

var a = token(char('a'));
var b = token(char('b'));

var prog = many(either(a, b));

This works, but requires unnecessary backtracking. For a program such as '___ba_alb' (Whitespace was getting stripped in the post so let _ be a space), in matching 'b', the whitespace is parsed twice, first for a and when a fails, again for b. Simply removing attempt does not work as either will never reach b if any whitespace is consumed.

Attempts

My first though was to move the token call outside of the either:

var a = char('a');
var b = char('b');

var prog = many(token(either(a, b)));

This works, but now prog cannot be reused easily. In building a parser library, this seems to require defining parsers twice: one version that actually consumes 'a' or 'b' and can be used in other parsers, and one version that correctly handles whitespace. It also clutters up parser definitions by requiring them to have explicit knowledge of how each parser operates and how it handles whitespace.

Question

The intended behavior is that an arbitrary amount of whitespace can be consumed before a token. If parsing the token fails, it backtracks to the start of the token instead of the start of the whitespace.

How can this be expressed without preprocessing the input to produce a token stream? Are there any good, real world code examples of this? The closest I found was Higher-Order Functions for Parsing but this does not seem to address my specific concern.

Upvotes: 3

Views: 1445

Answers (1)

Matt Fenwick
Matt Fenwick

Reputation: 49105

I solved this problem in a JSON parser that I built. Here's the key parts of my solution, in which I followed the 'write the parsers twice' approach somewhat:

  1. define the basic parsers for each token -- number, string, etc.

  2. define a token combinator -- that takes in a basic token parser and outputs a whitespace-munching parser. The munching should occur after so that whitespace is only parsed once, as you noted:

    function token(parser) {
        // run the parser, then munch whitespace
    }
    
  3. use the token combinator to produce the munching token parsers

  4. use the parsers from 3. to build the parsers for the rest of the language

I didn't have a problem with having two similar versions of the parsers, since each version was in fact distinct. It was slightly annoying, but a tiny cost. You can check out the full code here.

Upvotes: 0

Related Questions