Reputation: 7433
This code parses $string
as I'd like:
#! /usr/bin/env raku
my $string = q:to/END/;
aaa bbb # this has trailing spaces which I want to keep
kjkjsdf
kjkdsf
END
grammar Markdown {
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent> <content> }
token indent { \h* }
token newline { \n }
token content { \N*? <trailing>* <.newline> }
token trailing { \h+ }
}
my $match = Markdown.parse($string);
$match.say;
OUTPUT
「aaa bbb
kjkjsdf
kjkdsf
」
0 => 「aaa bbb
」
text => 「aaa bbb
」
indent => 「」
content => 「aaa bbb
」
trailing => 「 」
0 => 「
」
blank => 「
」
0 => 「 kjkjsdf
」
text => 「 kjkjsdf
」
indent => 「 」
content => 「kjkjsdf
」
0 => 「kjkdsf
」
text => 「kjkdsf
」
indent => 「」
content => 「kjkdsf
」
Now, the only problem I'm having is that I'd like the <trailing>
level to be in the same level of the hierarchy as <indent>
and <content
> captures.
So I tried this grammar:
grammar Markdown {
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent> <content> <trailing>* <.newline> }
token indent { \h* }
token newline { \n }
token content { \N*? }
token trailing { \h+ }
}
However, it breaks the parsing. So I tried this:
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent> <content>*? <trailing>* <.newline> }
token indent { \h* }
token newline { \n }
token content { \N }
token trailing { \h+ }
And got:
0 => 「aaa bbb
」
text => 「aaa bbb
」
indent => 「」
content => 「a」
content => 「a」
content => 「a」
content => 「 」
content => 「b」
content => 「b」
content => 「b」
trailing => 「 」
0 => 「
」
blank => 「
」
0 => 「 kjkjsdf
」
text => 「 kjkjsdf
」
indent => 「 」
content => 「k」
content => 「j」
content => 「k」
content => 「j」
content => 「s」
content => 「d」
content => 「f」
0 => 「kjkdsf
」
text => 「kjkdsf
」
indent => 「」
content => 「k」
content => 「j」
content => 「k」
content => 「d」
content => 「s」
content => 「f」
This is pretty close to what I want but it has the undesirable effect of breaking <content>
up into individual letters, which is not ideal. I could fix this pretty easily after the fact by massaging the $match
object but would like to try to improve my skills with grammars.
Upvotes: 5
Views: 150
Reputation: 4465
quick and dirty
my $string = q:to/END/;
aaa bbb
kjkjsdf
kjkdsf
END
grammar Markdown {
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent>? $<content>=\N*? <trailing>? <.newline> }
token indent { \h+ }
token newline { \n }
token trailing { \h+ }
}
my $match = Markdown.parse($string);
$match.say;
lookahead assertions
my $string = q:to/END/;
aaa bbb
kjkjsdf
kjkdsf
END
grammar Markdown {
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent>? <content> <trailing>? <.newline> }
token indent { \h+ }
token newline { \n }
token content { [<!before <trailing>> \N]+ }
token trailing { \h+ $$ }
}
my $match = Markdown.parse($string);
$match.say;
a little refactoring
my $string = q:to/END/;
aaa bbb
kjkjsdf
kjkdsf
END
grammar Markdown {
token TOP { ( <blank> | <text> )+ %% \n }
token blank { ^^ \h* $$ }
token text { <indent>? <content> <trailing>? }
token indent { ^^ \h+ }
token content { [<!before <trailing>> \N]+ }
token trailing { \h+ $$ }
}
my $match = Markdown.parse($string);
$match.say;
Upvotes: 6
Reputation: 7433
I was able to accomplish what I want with a negative lookahead assertion:
token TOP { ^ ([ <blank> | <text> ])+ $ }
token blank { [ \h* <.newline> ] }
token text { <indent>? <content> <trailing>? <.newline> }
token indent { \h+ }
token newline { \n }
token content { <.non_trailing> }
token non_trailing { ( . <!before \w \h* \n>)+ \S* }
token trailing { \h+ }
The <.non_trailing>
suppresses the individual characters from appearing in the match object and the . <!before \w \h* \n>)+ \S*
bit will match any character not followed by white space and a new line and the \S*
bit gets the character left over from the negative lookahead.
OUTPUT
「aaa bbb
kjkjsdf
kjkdsf
」
0 => 「aaa bbb
」
text => 「aaa bbb
」
content => 「aaa bbb」
trailing => 「 」
0 => 「
」
blank => 「
」
0 => 「 kjkjsdf
」
text => 「 kjkjsdf
」
indent => 「 」
content => 「kjkjsdf」
0 => 「kjkdsf
」
text => 「kjkdsf
」
content => 「kjkdsf」
Upvotes: 3