Reputation: 61
I want to write a Raku grammar to check a monthly report about work contribution factors, written in Racket/Scribble.
The report is divided, at the highest level, into monthly sections, and beneath that, into contribution factors. Within that subsection, a repeating set of contribution factors describes what I did for that contribution factor, during that month. I've included pared-down Racket code here.
The contribution factors are named "Quaffing" and "Quenching" as stand-ins for real contribution factors. Although I haven't included them here, there are further subsections (and subsubsections). Within each month, I include a standard set of subsections and subsubsections. As the first part of each subsection and subsubsection name, I include a standard name. As the second part of each subsection and subsubsection name, I tack on the year and month. The year and month is written out like "2024 12(December)". This, of course, changes each month, and keeps the section, subsections, and subsubsections distinct across the whole document.
I want to use a Raku grammar to parse the Racket-Scribble code to ensure it's consistently formatted. I want to ensure that the all the sections, subsections, and subsubsections are in place, and fit the pattern of standard subsection and subsubsection name, followed by current year/month.
For each month's section, I need the year/month to change, and I want the grammar to do so automatically.
Here is the Racket/Scribble code:
#lang scribble/manual
@title["\
Contribution Monthly Report\
" #:version "0.001"]
@table-of-contents{}
@section[#:tag "\
Report of 2024 12(December) 31\
"]{Report of 2024 12(December) 31}
@subsection{Contribution Factors Progress, 2024 12(December)}
@subsubsection[#:tag "\
Factor 1: Quaffing, 2024 12(December)\
"]{Factor 1: @italic{Quaffing}, 2024 12(December)}
Random lines of text.
@subsubsection[#:tag "\
Factor 2: Quenching, 2024 12(December)\
"]{Factor 2: @italic{Quenching}, 2024 12(December)}
Random lines of text.
@section[#:tag "\
Report of 2024 11(November) 30\
"]{Report of 2024 11(November) 30}
@subsection{Contribution Factors Progress, 2024 11(November)}
@subsubsection[#:tag "\
Factor 1: Quaffing, 2024 11(November)\
"]{Factor 1: @italic{Quaffing}, 2024 11(November)}
Lines of Text
@subsubsection[#:tag "\
Factor 2: Quenching, 2024 11(November)\
"]{Factor 2: @italic{Quenching}, 2024 11(November)}
Lines of text.
@index-section{}
For context, and reference, here is the whole Raku grammar I'm using to parse the above Racket/Scribble code. This code sample shows that I set a dynamic variable $*tsymm
[this sections year month Month] to hold the changing year month string that will be appended to the subsection and subsubsection name patterns. I've left in small debugging snippets.
Further on in this question, I've also placed just the token where I have the problem. :
use v6;
#use Test;
use Grammar::Tracer;
# Hardcoded file name
my $file-name = 'short_obfu_Monthly_Notes.rkt';
# Slurp the file content
my $file-content = try $file-name.IO.slurp;
if $! {
die "Error reading file '$file-name': $!";
}
grammar MonthlyReport {
#my $*tsymm;
token TOP {
:my $*tsymm;
^
<lang-statement>
<title>
<table-of-contents>
<monthly-cycle>+ # this token contains a refererence to the token with the problem.
<index>
$
}
token lang-statement {
^^'#' lang \s+ scribble '/' manual \n
}
token title {
\n
'@title["\\' \n
'Contribution Monthly Report\\' \n
'" #:version "0.001"]' \n
#{say '「' ~ $¢ ~ '」';}
}
token table-of-contents {
\n
'@table-of-contents{}' \s*? \n
}
token monthly-cycle {
{say $*tsymm}
<section-wrt-month>
<contribution-factors-progress>
}
token section-wrt-month { # This token is the problem.
\n
'@section[#:tag "' \\ \n
'Report of ' $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\)) \s [29|30|31] \\ \n
'"]{Report of ' $<this-sections-yyyy-mm-Month> \s \d\d \} \n
{say " trying 1 trying:\n\n $/ \n\n";}
{say " trying 2 trying:\n\n $/{'this-sections-yyyy-mm-Month'} \n\n";}
{$*tsymm = $<this-sections-yyyy-mm-Month>;}
#{$*tsymm.say;}
#{say '「' ~ $¢ ~ '」';}
}
token contribution-factors-progress {
\n
'@subsection{Contribution Factors Progress, ' $*tsymm \} \n
<factor1>
<factor2>
#{say '「' ~ $¢ ~ '」';}
}
token factor1 {
\n
'@subsubsection[#:tag "\\' \n
'Factor 1: Quaffing, ' $*tsymm \\ \n
'"]{Factor 1: @italic{Quaffing}, ' $*tsymm \} \n
.*? <?before \@subsubsection>
#{say 'factor 1 ->「' ~ $¢ ~ '」<- factor 1';}
}
token factor2 {
'@subsubsection[#:tag "\\' \n
'Factor 2: Quenching, ' $*tsymm \\ \n
'"]{Factor 2: @italic{Quenching}, ' $*tsymm \} \n
.*? <?before \@subsubsection>
#{say 'factor 2 ->「' ~ $¢ ~ '」<- factor 2';}
}
token index {
\n
'@index-section{}'
#{say 'index ->「' ~ $¢ ~ '」<- index';}
}
}
# Check the format of the file content
if MonthlyReport.parse($file-content) {
say "The file format is valid.";
} else {
say "The file format is invalid.";
}
The token that's not doing what I want is section-wrt[with regard to]-month
. This is the same code as above, just excerpted here to allow focus.
token section-wrt-month {
\n
'@section[#:tag "' \\ \n
'Report of ' $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\)) \s [29|30|31] \\ \n
'"]{Report of ' $<this-sections-yyyy-mm-Month> \s \d\d \} \n
{say " trying 1 trying:\n\n $/ \n\n";}
{say " trying 2 trying:\n\n $/{'this-sections-yyyy-mm-Month'} \n\n";}
{$*tsymm = $<this-sections-yyyy-mm-Month>;}
#{$*tsymm.say;}
#{say '「' ~ $¢ ~ '」';}
}
I expected the named regex, $<this-sections-yyyy-mm-Month>=(\d\d\d\d \s \d\d \([January|February|March|April|May|June|July|August|September|October|November|December]\))
to be set when it finds the month section (this does work). I want it to reset on the second pass through the section-wrt-month
, but this does not expected it to change to 2024 11(November)
, but it does not.
I tried changing the token
to a rule
and a regex
, but none of those helped.
I tried setting $*tsymm
to $0, but that does not work.
I consulted ChatGPT, o1, but it lectured me, incorrectly, about details of the alternations within the regex. When I tried what it (so confidently) lectured me on, it was not true, and was not related to the main problem.
I tried searching this out in the Raku on-line documentation as well as in several Raku/Perl6 books I have. They don't get into enough detail to help with this.
The output contains the ANSI coloring and shows the failure:
enter image description hereOutput
Upvotes: 5
Views: 128
Reputation: 2341
Not a full answer, but an approach using Raku's ISO-8601 capabilities:
~$ raku -e 'my %months = [Z=>] <Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec>, 1..12; %months.sort(*.value).say;'
(Jan => 1 Feb => 2 Mar => 3 Apr => 4 May => 5 Jun => 6 Jul => 7 Aug => 8 Sep => 9 Oct => 10 Nov => 11 Dec => 12)
Since the OP has noted the brittleness of the original code posted, a more robust solution is to decipher input dates into ISO-8601 format, and then output using various Raku formatting capabilities.
~$ echo 'Feb/01/2025' | raku -pe 'BEGIN my %months = [Z=>] <Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec>, 1..12; \
s{ ^ ( <.alpha>**3 ) \/ (\d**2) \/ (\d**4) } = "$2-{sprintf q[%02d], %months{$0.tclc}}-$1".Date;'
2025-02-01
~$ echo 'Feb/01/2025' | raku -pe 'BEGIN my %months = [Z=>] <Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec>, 1..12; \
s{ ^ ( <.alpha>**3 ) \/ (\d**2) \/ (\d**4) } = "$2-{sprintf q[%02d], %months{$0.tclc}}-$1".Date.^name;'
Date
If you use ISO-8601 dates with Raku's .Date
method, you can actually check the validity ot the date, such as ensuring you're rejecting dates referring to the 32nd day of the month.
A common problem with deciphering dates is taking input that has a variable number of digits, like 1 February
not 01 February
. The short answer is you write the regex \d**1..2
instead, then sprintf
to pad. See details here:
https://unix.stackexchange.com/a/769411
I'll conclude by saying that I don't understand the \s [29|30|31]
in your Racket code. Regardless, should the corresponding Raku section be in brackets, followed by an optional flag, e.g. [ \s [29|30|31] ]?
?
Further ISO-8601 ideas using Raku, below:
https://unix.stackexchange.com/search?q=user%3A227738++ISO+8601 https://docs.raku.org/type/Date
Upvotes: 0
Reputation: 7581
This is not an answer in the specific sense, but a general response to your question that I hope will be helpful.
[As ever, I defer to @raiph's attention to detail and actual code fix]
I am very impressed with the progress that you have made and would encourage you to keep going ... I have built several realworld raku grammars and they are always quite intricate since that is the nature of parsing/regex at a character level. I am sure you know to take ChatGPT with a pinch of salt.
At first, I wanted to say "don't use raku grammars to solve this problem, the quickest way to extract data from your source file is more likely to be a set of regexes". Why? Well the source file is quite odd - there is a prediliction for newlines and repeated info. A regex type approach would try and pick out anchors (eg section
, subsection
, subsubsection
) and then key off these to capture the variable data. In contrast a grammar like yours is trying to pick up all the text and is more work and more prone to small errors.
Then I saw you wrote that you want to check the correctness/completeness of the source. [This goal seems a bit nutty to be, but I am sure you have your reasons]
In this case, I think you have made a good (comprehensive) start, but your Grammar is brittle - would you really care if version 0.001
became version 0.002
?
So, my current view based on how I would do this myself, is to say that your grammar token structure needs to have a good impedance fit with the language that you are parsing. This is another way of saying take a top down look and try to extract the patterns that you want to extract in a hierarchical way.
What do I mean by that, what would I change...
Many of the features are 3 line stanzas - so I would try to make a general to match these paras
Many of these have repeat text - so I would try to check and then eliminate the duplications
They have a consistent syntax built with components, so have tokens for each component
Something like:
... what you have already around TOP ...
token stanza { <header> <tagged> <untagged> }
token header { '@' ['section'|'subsection'|<subsub>] }
token tagged { '[#:tag "\' <factor> <subject> <yyyymm> ']' } # look up ~ and % in the docs
token untagged { '{' <factor> <subject> <yyyymm> '}' }
token factor { Factor \d+ ':' <.ws> }
token subject { [\@italic\{]? [Quaffing | Quenching] [}]? ',' <.ws> }
token yyyymm { like you have it }
This is just a rough idea ... but hopefully you get the feeling for the level of granularity / reusability of tokens.
Upvotes: 2
Reputation: 32489
TL;DR This initial answer provides terse summaries of:
A way to make your grammar work.
What your code is doing wrong.
Why I think you got confused.
I intend to write one or more other answers, and/or later edit this one. The point would be to go into greater depth for the above three topics plus some others. But I wanted to give you something tonight, partly because my plan may not pan out, and partly to provide something in the meantime even if I do end up writing more.
Insert a \n
at the start of factor2
. This is consistent with all the other tokens you'd written. It's a tidy up coordinated with the second change:
Add a token end-of-section { $ | <before \n \@ <[a..zA..Z-]>*? 'section'> }
to the grammar and replace the <?before \@subsubsection>
patterns in the two factor tokens with <end-of-section>
.
I'm not saying those are necessarily the changes you really want for your full grammar. I am saying they work for the code you've shared in your question, and will hopefully be illuminating and perhaps a step forward to an appropriate solution.
The regex .*? <?before \@subsubsection>
matches all text from the current parse position forward to just before the next instance of the text @subsubsection
.
The first use of this pattern in your factor1
code works as you want. That's because the @subsubsection
that the <?before \@subsubsection>
matches is the one immediately following the random text you wrote that is still within December.
But the first use of this pattern in your factor2
code does not work as you want:
It starts to do parsing at the point immediately following where the factor1
token finished off matching. That is to say it starts at the (blank line before the) @subsubsection[#:tag "\ Factor 2:
that's still in December. This is still what you want.
It then keeps matching until it reaches the next @subsubsection
. But the next one is in the November data!
The upshot is that the first time through section-wrt-month
does "successfully" match, but it achieves that "success" erroneously -- it gobbles up the input part way into November's data as it matches!
Thus the second call of section-wrt-month
begins its matching at the first (blank line before the) @subsubsection[#:tag "\ Factor 1: Quaffing, 2024 11(November)\
. This is of course the wrong place to be parsing. So it fails to match. And then the index
token also begins at the same place, which is the wrong place for it too, so it also fails to match, and then the whole parse fails.
I imagine there are likely many factors leading to your confusion including:
Weaknesses of the debugging tools you're using? Grammar::Tracer
was a wonderful new tool when it was first introduced (in 2011). But there are other options, and it looks like this old tool led you astray. (I imagine you were tricked by the green lights on the first two factor token matches. The match capture string it displays is truncated, so you can't see that while factor1
captured as you wanted, factor2
captured too much.)
Lack of familiarity with Raku? It looks like you know a lot. Dynamic variables!?! $¢
!?! But it's hard to know if that's just ChatGPT throwing random guessing at you.
Lack of familiarity with regexing and/or being thrown by thinking the problem was something to do with use of Raku? Again, it's plausible you know regexing well, or ChatGPT thinks it does, but a fundamental problem here was not realizing what .*?
was doing. The regex atom .*?
is not Raku specific but is instead found in pretty much all regex languages. Similarly, syntax aside, <?before foo>
is just a look ahead predicate which has the same semantics in Raku as it does in the many (most) other regex languages/libraries/engines which have the same feature.
As I said at the start, I hope to later provide guidance so that you have much more fun and/or are much more productive than I imagine you managed with this work/exercise so far. Or perhaps others will pitch in with comments or answers.
Upvotes: 3