Dylan Beattie
Dylan Beattie

Reputation: 54130

Can I prevent expressions matching prefixes of other alternate expressions in a parsing expression grammar?

I have a parsing expression grammar that parses assignments which resemble natural language - part of the Rockstar programming language (https://codewithrockstar.com/), and I'm using PeggyJS (https://peggyjs.org/online.html) to test the grammar.

The following grammar works:

program = head:assign EOL tail:program 
    { return [ head ].concat(tail) }
    / EOL { return [] }
    
assign = v:variable _ is  _ n:number
    { return { assign: { target: v, number: n } } }

is  = 'is'i / 'are'i

number = n:$([0-9]+) { return parseFloat(n) } 

variable = common_variable / proper_variable

common_variable = name:$(common_prefix _ identifier)
    { return name }

common_prefix = 'a'i / 'the'i / 'my'i /'your'i

identifier = name:$([A-Za-z][A-Za-z0-9]+)
    { return name }
    
proper_variable = name:$(proper_noun (_ proper_noun)*)
    { return name }

proper_noun = !keyword name:$([A-Z][A-Za-z0-9]*)
    { return name }

keyword = (is / common_prefix) !letter
letter = [a-z]

_ = [ \t\n\r]*
EOL = '\r'?'\n'

This grammer can parse statements like:

Big Horses Are 10
My life is 15
The Sky Is 20
Your code is 9

The problem is the keyword rule. If I reverse the order of the rules it references:

keyword = (common_prefix / is) !letter

the grammar stops working. I think is is because the 'a' sequence defined by common_prefix is a prefix of the 'are' sequence defined by is, and the !letter lookeahead is outside the () group, so the parser finds the token are, checks the !keyword rule, finds the leading 'a', matches the rule, finds that the next character is a letter ('r'), which causes the !letter lookahead to fail, and so doesn't recognise that are is a keyword.

This means adding rules to the keywords collection becomes incredibly error-prone; if any pattern in a rule is a prefix of a pattern defined in another rule, the grammar breaks in unpredictable ways.

I have been able to work around this behaviour by adding the !letter lookahead to individual keywords within a rule:

keyword = ( 'a' !letter / 'are' !letter )

but that would mean adding every keyword individually, which would also be extremely repetitive and error-prone.

Is there a way to specify that a negative lookahead should apply to every pattern with the alternation expression to avoid the prefix-matching behaviour I'm seeing?

Upvotes: 0

Views: 77

Answers (0)

Related Questions