Reputation: 1033
Suppose grammar G has the following productions …
Using grammar G, a derivation of the string aaaaabb is …
S ⇒ aSb ⇒ aaSbb ⇒ aaAbb ⇒ aaaaabb
In the Raku grammar shown below, why is parsing of the string aaaaabb not successful? Is the derivation shown above not applicable?
# Goal: language generated by grammar = { a^n b^m : n <= m + 3 }
# Example string is:
# aaaaabb = a^5 b^2, where 5 <= 2 + 3
# Derivation of string aaaaabb is:
# S => aSb => aaSbb => aaAbb => aaaaabb
grammar MyContextFreeGrammar
{
token TOP { <S> }
token S { 'a' <S> 'b' | <A> | <B> }
token A { '' | 'a' | 'a' 'a' | 'a' 'a' 'a' }
token B { 'b' <B> | 'b' }
} # end grammar
my $abString = 'aaaaabb';
my $myParseResult = MyContextFreeGrammar.parse( $abString ) ;
if $myParseResult.Bool
{
printf( "\nParsing of \'%s\' was *SUCCESSFUL*.\n", $abString ) ;
}
else
{
printf( "\nParsing of \'%s\' was *NOT* successful.\n", $abString ) ;
}
Program output is …
Parsing of 'aaaaabb' was *NOT* successful.
Raku Documentation (https://docs.raku.org/language/grammars
) indicates that …
(The declaratorstoken
andrule
) are both ratcheting, which means that the match engine won't back up and try again if it fails to match something.
The revised grammar shown below — which utilizes regex
rather than token
— allows for backtracking and behaves as desired for the two test strings aaaaabb
(which is in the language generated by grammar G) and aaaaaabb
(which is not in the language generated by grammar G) …
grammar MyContextFreeGrammar
{
regex TOP { <S> }
regex S { 'a' <S> 'b' | <A> | <B> }
regex A { '' | 'a' | 'a' 'a' | 'a' 'a' 'a' }
regex B { 'b' <B> | 'b' }
} # end grammar
my $abString = 'aaaaabb';
my $myParseResult = MyContextFreeGrammar.parse( $abString ) ;
printf( "\nParsing of \'%s\' was "
~ ($myParseResult ?? "*SUCCESSFUL*.\n" !! "*NOT* successful.\n" ),
$abString ) ;
$abString = 'aaaaaabb';
$myParseResult = MyContextFreeGrammar.parse( $abString ) ;
printf( "\nParsing of \'%s\' was "
~ ($myParseResult ?? "*SUCCESSFUL*.\n" !! "*NOT* successful.\n" ),
$abString ) ;
Program output is …
Parsing of 'aaaaabb' was *SUCCESSFUL*.
Parsing of 'aaaaaabb' was *NOT* successful.
Upvotes: 7
Views: 176
Reputation: 32489
A natural idiomatic solution to the problem example:
grammar S { token TOP { (a*) (b*) <?{ $0.chars <= $1.chars + 3 }> } }
This directly encodes your example's description in a parser -- the token
¹ does a match/capture of zero or more a
s, then a match/capture of zero or more b
s, and finally a predicate check corresponding to your example's formula.
Test code and results:
say "$_: {so S.parse: $_}" for ^9 .map: { 'a' x $_ ~ 'bb' }
bb: False
abb: True
aabb: True
aaabb: True
aaaabb: True
aaaaabb: True
aaaaaabb: False
aaaaaaabb: False
aaaaaaaabb: False
One can use generative grammars with Raku just as much as one can use them with any other PL.
But the vision and mission for Raku includes first class grammars (and grammar fragments) that can be composed.
If you combine two or more generative CFGs the result may be ambiguous even if the constituent grammars are unambiguous. There is no algorithm that can detect if the result is ambiguous so there can't even be a warning or error if a composition yields this result.
So generative grammars could not be used for its built in grammar
construct.
To be clear, grammar composition isn't just a very-nice-to-have but something that's increasingly essential in an increasingly polyglot world. Folk want to be able to do things like embed SQL in GPL code. Now what? If the SQL is in a string you're subject to injection attacks. Larry Wall decided there had to be a better way, a principled way: composable grammars.
cf my comments in the reddit thread Ergonomic inline SQL as a Python library. Two key excerpts:
(One key thing I forgot to mention and that might not be obvious is that Raku slangs are fully integrated at compile time. So there can be compile-time syntax, type, etc. errors, and full defense against injection attacks, which is a lot harder if the DSL (eg SQL) is in strings.)
Wow, I had no idea that such a language "braid" idea existed. Thanks for sharing that little bit of Raku. The "sql insert into ...; with" line is especially cool, since you literally are writing multiple languages in each other with very little syntax noise.
¹ The token
declarator implies ratcheting:
Ratcheting can be an optimization, because backtracking is costly. But more importantly, it closely corresponds to how humans parse a text.
The doc fails to mention another important aspect: elimination of (the risk of) catastrophic backtracking.
Upvotes: 11