Geier
Geier

Reputation: 904

Separation of tasks between lexer and parser in parsing regular expressions

I am somewhat confused about the separation of tasks between a lexer and a parser.

I'm trying to write a parser which takes Perl-style regular expression and builds a syntax tree. My problem is recognizing quantifiers such as {n,m}, which means that the preceeding group or character or character class should occur at least n, but not more than m times.

The point is that a incomplete/invalid quantifier such as {2,5asdf} is not a quantifier, but a group of regular characters.

The question is: Given the input /a{2,5}/, should a lexer return a list of tokes such as DELIMITER CHARACTER QUANTIFIER_START NUMBER COMMA NUMBER QUANTIFIER_END DELIMITER END (the problem being that the QUANTIFIER_START may not be a "real" start of a quantifier, depending on what follows), or should it try to match the complete quantifier and just return QUANTIFIER, which intuitively sounds more like a task for a parser?

Upvotes: 1

Views: 77

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170158

Using a tool where the lexer and parser are separated, there's usually little room for you to change tokens during lexing. The lexer usually operates independently from the parser, and making lexing context sensitive is, if possible, hacky (you might want to Google for PEG, or scannerless parsing, where there's no real separation between lexing and parsing).

However, it all depends on the tool you're using. I've create a PCRE parser using ANTLR that makes use of backtracking if a parse fails. So, if after parsing {2,5a it fails to construct a quantifier (the a is invalid), the parser will backtrack to "{" and make a LITERAL token from this char and will then continue. At the cost of a bit of RAM, I enabled memoization causing the parser to still perform well on large inputs.

It parses X{2,5asdf} as:

'- ALTERNATIVE
   |- ELEMENT
   |  '- LITERAL='X'
   |- ELEMENT
   |  '- LITERAL='{'
   |- ELEMENT
   |  '- LITERAL='2'
   |- ELEMENT
   |  '- LITERAL=','
   |- ELEMENT
   |  '- LITERAL='5'
   |- ELEMENT
   |  '- LITERAL='a'
   |- ELEMENT
   |  '- LITERAL='s'
   |- ELEMENT
   |  '- LITERAL='d'
   |- ELEMENT
   |  '- LITERAL='f'
   '- ELEMENT
      '- LITERAL='}'

and X{2,5} as:

'- ALTERNATIVE
   '- ELEMENT
      |- LITERAL='X'
      '- QUANTIFIER
         |- NUMBER='2'
         |- NUMBER='5'
         '- GREEDY

You can play around with the parser here: http://pcreparser.appspot.com/

The ANTLR grammar can be found here: https://github.com/bkiers/PCREParser/blob/master/src/grammar/PCRE.g

Upvotes: 1

Related Questions