Reputation: 7433
OK, this is either a bug or I'm going to look like a total idiot and I'm using a lookaround assertion completely wrong. I don't care about the latter so here we go.
Got this grammar I'm testing:
our grammar HC2 {
token TOP { <line>+ }
token line { [ <header> \n | <not-header> \n ] }
token header { <header-start> <header-content> }
token not-header { \N* }
token header-start { <header-one> }
token header-one { <[#]> <![#]> } # note this negative lookahead here
token header-content { \N* }
}
I want to capture a markdown header with just one #
sign, no more.
Here is the output from Grammar::Tracer/Debugger:
So it's skipping right over the <header-start>
capture. If I remove the <![#]>
negative lookahead assertion, I get this:
So is this a bug or am I out to lunch?
As text:
TOP
>
| line
>
| | not-header
>
| | * MATCH "# Grandmother's for a Brighter Future"
>
| * MATCH "# Grandmother's for a Brighter Future\n"
>
| line
>
| | not-header
>
| | * MATCH ""
>
| * MATCH "\n"
>
| line
>
| | not-header
>
| | * MATCH "# Development site"
>
| * MATCH "# Development site\n"
>
| line
>
| | not-header
>
| | * MATCH "* The new site is up and running at example.com"
>
| * MATCH "* The new site is up and running at example.com\n"
>
| line
>
| | not-header
>
TOP
>
| line
>
| | header
>
| | | header-start
>
| | | | header-one
>
| | | | * MATCH "#"
>
| | | * MATCH "#"
>
| | | header-content
>
| | | * MATCH " Grandmother's for a Brighter Future"
>
| | * MATCH "# Grandmother's for a Brighter Future"
>
| * MATCH "# Grandmother's for a Brighter Future\n"
>
| line
>
| | not-header
>
| | * MATCH ""
>
| * MATCH "\n"
>
| line
>
| | header
>
| | | header-start
>
| | | | header-one
>
| | | | * MATCH "#"
>
| | | * MATCH "#"
UPDATE: If I modify header-start
to:
token header-one { <[#]> <-[#]> }
it matches as expected. However, that does not answer the question as to why the original code does not match.
Upvotes: 5
Views: 87
Reputation: 7433
OK, so the non-technical answer is I made a bad assumption that the |
character behaves the same was as in Perl. It does not. In Perl, the regex engine attempts to match the pattern on the left hand side of the |
character. If that fails, it moves on to the pattern in the right hand side.
To get the "old school" Perl behavior, use the ||
operator, called the "Alternation" operator: https://docs.raku.org/language/regexes#Alternation:_||
The |
operator is called the "Longest Alternation" operator. See https://docs.raku.org/language/regexes#Longest_alternation:_|
A more detailed, much more technical discussion of how the "Longest Alternation" operator works is here: https://design.raku.org/S05.html#Longest-token_matching
Though I was already aware the ||
existed from my reading of the docs, I didn't read about it carefully. I mistakenly assumed Raku core developer would make |
behave like it did in Perl and that ||
was some cool new operator I could learn about later.
Big takeaway: try hard to uncover the basic assumptions you are making and don't assume anything until you've read the docs closely.
Upvotes: 5