Can alternative rules overlap in their intial tokens in PEG?

Question

I think I just realized that a problem I've run into several times and for which I've always found some odd work-around might just be fundamental. I'm hoping someone can either validate that understanding or show me what I'm doing wrong.

I'm trying to create a parser that will handle several variants of a syntax. But I guess there is ambiguity because in the simplest case, each variant allows the same input. Since I was mapping them to the same syntax tree, I had hoped that it wouldn't matter. Of course it does.

I'm using Peg.js and trying to build a grammar that will accept, for simplicity's sake, inputs like abc/cd/e, x/yz, or just foo, but also accepts abc.cd.e, x.yz, and, of course still foo. I'm guessing the problem is that both of my variants accept foo.

This is what I was trying to do:

Expression = SlashTerm / DotTerm 
SlashTerm = Name ('/' SlashTerm)?
DotTerm = Name ('.' DotTerm)?
Name = [a-z0-9_\-]i+

And this recognizes foo/bar, but not foo.bar. If I switch to Expression = DotTerm / SlashTerm, of course it recognizes foo.bar and not foo/bar.

So the question is there any better way of handling this than manually forcing the choice with something like

Expression = DotTerm / SlashTerm 
SlashTerm = Name ('/' SlashTerm)?
DotTerm = Name '.' DotNode
DotNode = Name ('.' DotNode)?
Name = [a-z0-9_\-]i+

While there is no issue adding the additional rule, I don't like that this required me to switch from Expression = SlashTerm / DotTerm to Expression = DotTerm / SlashTerm. I thought either one should work since there is no ambiguity. Perhaps I misunderstand the choice operation, but my impression is that using Expression = SlashTerm / DotTerm, when it tried to parse foo.bar, it would get through foo, hit the ., not have a match and backtrack to the top of SlashTerm, which wouldn't match so that it would move to DotTerm, where it would find a match. Since that doesn't work, my understanding is incorrect, and I'm hoping someone can explain why.

I'd also love to hear about better ways to do this.

My real grammar is of course much more complex; SlashTerm is much more involved. But the DotTerm equivalent really is that simple. The DotTerm equivalent syntax (which I can obviously handle easily without a full-fledged parser) is the only supported syntax now, but I'm trying to expand it to a much more robust language. I'd like to not branch the code based on the choice of syntax, and want to easily be able to deprecate it later. Including a few rules in my grammar seems the simplest way to do do. But if there is another simple way to do this, I'd love to hear it.

rici · Accepted Answer

But SlashTerm does match. It doesn't match the entire input, but it successfully matches a part of the input. Consequently, SlashTerm / DotTerm successfully matches, too. And PEG doesn't backtrack over a successfully matched component. For PEG to backtrack, the component being matched must fail.

So one solution is to insist that the alternatives match the first delimiter, unless there isn't one:

Expression = SlashTerm / DotTerm / Name 
SlashTerm = Name ('/' Name)+
DotTerm = Name ('.' Name)+
Name = [a-z0-9_\-]i+

In that case, neither SlashTerm nor DotTerm will match a single word, so the choice operator must fall through to the third alternative. Similarly, foo.bar cannot match SlashTerm, so it will fail and the choice operator will fallback to the second alternative as expected.

In the above, I replaced the recursion with an iteration, which seemed to me to be simpler. If that doesn't fit your grammar model, you can adapt your original solution with one extra non-terminal:

Expression = SlashTerm / DotTerm / Name
SlashTerm = Name '/' SlashRest
DotTerm = Name '.' DotRest
SlashRest = Name ('/' SlashRest)?
DotRest = Name ('.' DotRest)?
Name = [a-z0-9_\-]i+

Can alternative rules overlap in their intial tokens in PEG?

Answers (1)

Related Questions