StevieD
StevieD

Reputation: 7433

Grammar not parsing as expected with negative lookaround assertion

OK, this is either a bug or I'm going to look like a total idiot and I'm using a lookaround assertion completely wrong. I don't care about the latter so here we go.

Got this grammar I'm testing:

our grammar HC2 {
        token TOP { <line>+ }
        token line { [ <header> \n | <not-header> \n ] }
        token header { <header-start> <header-content> }
        token not-header { \N* }
        token header-start { <header-one> }
        token header-one { <[#]> <![#]> } # note this negative lookahead here
        token header-content { \N* }
}

I want to capture a markdown header with just one # sign, no more.

Here is the output from Grammar::Tracer/Debugger:

enter image description here

So it's skipping right over the <header-start> capture. If I remove the <![#]> negative lookahead assertion, I get this:

enter image description here

So is this a bug or am I out to lunch?

As text:

TOP
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "# Grandmother's for a Brighter Future"
> 
|  * MATCH "# Grandmother's for a Brighter Future\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH ""
> 
|  * MATCH "\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "# Development site"
> 
|  * MATCH "# Development site\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH "* The new site is up and running at example.com"
> 
|  * MATCH "* The new site is up and running at example.com\n"
> 
|  line
> 
|  |  not-header
> 


TOP
> 
|  line
> 
|  |  header
> 
|  |  |  header-start
> 
|  |  |  |  header-one
> 
|  |  |  |  * MATCH "#"
> 
|  |  |  * MATCH "#"
> 
|  |  |  header-content
> 
|  |  |  * MATCH " Grandmother's for a Brighter Future"
> 
|  |  * MATCH "# Grandmother's for a Brighter Future"
> 
|  * MATCH "# Grandmother's for a Brighter Future\n"
> 
|  line
> 
|  |  not-header
> 
|  |  * MATCH ""
> 
|  * MATCH "\n"
> 
|  line
> 
|  |  header
> 
|  |  |  header-start
> 
|  |  |  |  header-one
> 
|  |  |  |  * MATCH "#"
> 
|  |  |  * MATCH "#"

UPDATE: If I modify header-start to:

token header-one { <[#]> <-[#]> }

it matches as expected. However, that does not answer the question as to why the original code does not match.

Upvotes: 5

Views: 87

Answers (1)

StevieD
StevieD

Reputation: 7433

OK, so the non-technical answer is I made a bad assumption that the | character behaves the same was as in Perl. It does not. In Perl, the regex engine attempts to match the pattern on the left hand side of the | character. If that fails, it moves on to the pattern in the right hand side.

To get the "old school" Perl behavior, use the || operator, called the "Alternation" operator: https://docs.raku.org/language/regexes#Alternation:_||

The | operator is called the "Longest Alternation" operator. See https://docs.raku.org/language/regexes#Longest_alternation:_|

A more detailed, much more technical discussion of how the "Longest Alternation" operator works is here: https://design.raku.org/S05.html#Longest-token_matching

Though I was already aware the || existed from my reading of the docs, I didn't read about it carefully. I mistakenly assumed Raku core developer would make | behave like it did in Perl and that || was some cool new operator I could learn about later.

Big takeaway: try hard to uncover the basic assumptions you are making and don't assume anything until you've read the docs closely.

Upvotes: 5

Related Questions