Reputation: 54130
I have a parsing expression grammar that parses assignments which resemble natural language - part of the Rockstar programming language (https://codewithrockstar.com/), and I'm using PeggyJS (https://peggyjs.org/online.html) to test the grammar.
The following grammar works:
program = head:assign EOL tail:program
{ return [ head ].concat(tail) }
/ EOL { return [] }
assign = v:variable _ is _ n:number
{ return { assign: { target: v, number: n } } }
is = 'is'i / 'are'i
number = n:$([0-9]+) { return parseFloat(n) }
variable = common_variable / proper_variable
common_variable = name:$(common_prefix _ identifier)
{ return name }
common_prefix = 'a'i / 'the'i / 'my'i /'your'i
identifier = name:$([A-Za-z][A-Za-z0-9]+)
{ return name }
proper_variable = name:$(proper_noun (_ proper_noun)*)
{ return name }
proper_noun = !keyword name:$([A-Z][A-Za-z0-9]*)
{ return name }
keyword = (is / common_prefix) !letter
letter = [a-z]
_ = [ \t\n\r]*
EOL = '\r'?'\n'
This grammer can parse statements like:
Big Horses Are 10
My life is 15
The Sky Is 20
Your code is 9
The problem is the keyword
rule. If I reverse the order of the rules it references:
keyword = (common_prefix / is) !letter
the grammar stops working. I think is is because the 'a'
sequence defined by common_prefix
is a prefix of the 'are'
sequence defined by is
, and the !letter
lookeahead is outside the ()
group, so the parser finds the token are
, checks the !keyword
rule, finds the leading 'a'
, matches the rule, finds that the next character is a letter ('r'
), which causes the !letter
lookahead to fail, and so doesn't recognise that are
is a keyword.
This means adding rules to the keywords collection becomes incredibly error-prone; if any pattern in a rule is a prefix of a pattern defined in another rule, the grammar breaks in unpredictable ways.
I have been able to work around this behaviour by adding the !letter
lookahead to individual keywords within a rule:
keyword = ( 'a' !letter / 'are' !letter )
but that would mean adding every keyword individually, which would also be extremely repetitive and error-prone.
Is there a way to specify that a negative lookahead should apply to every pattern with the alternation expression to avoid the prefix-matching behaviour I'm seeing?
Upvotes: 0
Views: 77