Reputation: 1916
I have a grammar that I'm trying to parse with the help of Regexp::Grammars but for some reason it looks like it's having a whitespace problem. I've managed to reduce it to the following:
use Modern::Perl;
use v5.16;
use Regexp::Grammars;
use Data::Dumper;
my $grammar = qr{
<foo> <baz> | my <foo> is <baz>
<rule: foo> foo | fu | phoo
<rule: baz> bazz?
}ix;
while (<>) {
chomp;
if (/$grammar/) {
say Dumper(\%/);
}
else {
say "NO MATCH!!\n";
}
}
When the program is run and any matching sequence such as
foo baz
phoo bazz
my fu is baz
is entered the program returns
NO MATCH!!
However, if I insert a debugging directive before the grammar definition:
<debug: match>
<foo> <baz> | my <foo> is <baz>
...
I get what I expect:
perl.exe : ========> Trying <grammar> from position 0
At line:1 char:5
+ perl <<<< .\test_grammar2.pl 2>&1 > output.txt
+ CategoryInfo : NotSpecified: (========> Tryin...from position 0:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
phoo bazz |...Trying <foo>
| |...Trying subpattern /foo/
| | \FAIL subpattern /foo/
| |...Trying next alternative
| |...Trying subpattern /fu/
| | \FAIL subpattern /fu/
| |...Trying next alternative
| |...Trying subpattern /phoo/
bazz | | \_____subpattern /phoo/ matched 'phoo'
| \_____<foo> matched 'phoo'
|...Trying <baz>
| |...Trying subpattern /bazz?/
[eos] | | \_____subpattern /bazz?/ matched 'bazz'
| \_____<baz> matched ' bazz'
\_____<grammar> matched 'phoo bazz'
$VAR1 = {
'' => 'phoo baz',
baz => ' bazz',
foo => 'phoo'
};
Similarly, if I put an optional whitespace sequence between the subrule and literal calls:
<foo>\s*<baz> ...
...
I also get a match.
I'm using Winodws 7, ActivePerl Build 1603, Perl 5.16.3 and PowerShell. I've tried using cmd.exe as well just in case there was some obscure PowerShell issue but I have the same problem. I've also tried matching directly:
my $s = q(fu baz);
if ($s =~ $grammar) {
...
}
but I get the same problem--with the same solution.
EDIT: What I've learned.
When using the Regexp::Grammars module, if your grammar requires spaces between literals, subrules or both then you need to either encapsulate:
<foobaz>
<rule: foobaz> <foo> <baz> | my <foo> is <baz>
escape:
<foo>\ <baz> | my\ <foo>\ is\ <baz>
or insert whitespace sequences:
<foo>\s+<baz> | my\s+<foo>\s+is\s+<baz>
Upvotes: 3
Views: 165
Reputation: 163
Okay, I figured out what the issue was. The top level match in a Regexp::Grammars expression is treated in token mode (whitespace not ignored) rather than in rule mode (whitespace ignored). So, to get what you want, you only need to add a top rule:
my $grammar = qr{
<top>
<rule: top> <foo> <baz> |
my <foo> is <baz>
<rule: foo> foo | fu | phoo
<rule: baz> bazz?
}ix;
Here's my complete program:
use Modern::Perl;
use v5.16;
use Regexp::Grammars;
use Data::Dumper;
my $grammar = qr{
<top>
<rule: top> <foo> <baz> |
my <foo> is <baz>
<rule: foo> foo | fu | phoo
<rule: baz> bazz?
}ix;
1;
while (<>) {
chomp;
if (/$grammar/) {
say Dumper(\%/);
}
else {
say "NO MATCH!!\n";
}
}
Here's my output:
% echo FU baz | perl grammar.pl
$VAR1 = {
'' => 'FU baz',
'top' => {
'' => 'FU baz',
'baz' => 'baz',
'foo' => 'FU'
}
};
% echo my phoo is bazz | perl grammar.pl
$VAR1 = {
'' => 'my phoo is bazz',
'top' => {
'' => 'my phoo is bazz',
'baz' => 'bazz',
'foo' => 'phoo'
}
};
The documentation for Regexp::Grammars specifically states that the top level is done in token mode. Adding a top level token only adds one layer to the parse tree, but I don't think you have a choice if whitespace is to be ignored at the top level.
Upvotes: 2