gvkv
gvkv

Reputation: 1916

Handling Whitespace in the Regexp::Grammars Module

I have a grammar that I'm trying to parse with the help of Regexp::Grammars but for some reason it looks like it's having a whitespace problem. I've managed to reduce it to the following:

use Modern::Perl;
use v5.16;

use Regexp::Grammars;
use Data::Dumper;

my $grammar = qr{ 
    <foo> <baz> | my <foo> is <baz>

    <rule: foo> foo | fu | phoo
    <rule: baz> bazz?
}ix;

while (<>) {
    chomp;

    if (/$grammar/) {
        say Dumper(\%/);
    }
    else {
        say "NO MATCH!!\n";
    }

}

When the program is run and any matching sequence such as

foo baz
phoo bazz
my fu is baz

is entered the program returns

NO MATCH!!

However, if I insert a debugging directive before the grammar definition:

<debug: match>
<foo> <baz> | my <foo> is <baz>
...

I get what I expect:

perl.exe : ========> Trying <grammar> from position 0
At line:1 char:5
+ perl <<<<  .\test_grammar2.pl 2>&1 > output.txt
    + CategoryInfo          : NotSpecified: (========> Tryin...from position 0:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

phoo bazz |...Trying <foo>    

          |   |...Trying subpattern /foo/    
          |   |    \FAIL subpattern /foo/
          |   |...Trying next alternative    
          |   |...Trying subpattern /fu/    
          |   |    \FAIL subpattern /fu/
          |   |...Trying next alternative    
          |   |...Trying subpattern /phoo/    
 bazz     |   |    \_____subpattern /phoo/ matched 'phoo'    
          |    \_____<foo> matched 'phoo'    
          |...Trying <baz>    
          |   |...Trying subpattern /bazz?/    
[eos]     |   |    \_____subpattern /bazz?/ matched 'bazz'    
          |    \_____<baz> matched ' bazz'    
           \_____<grammar> matched 'phoo bazz' 

$VAR1 = {
          '' => 'phoo baz',
          baz => ' bazz',
          foo => 'phoo'
        };

Similarly, if I put an optional whitespace sequence between the subrule and literal calls:

<foo>\s*<baz> ...
...

I also get a match.

I'm using Winodws 7, ActivePerl Build 1603, Perl 5.16.3 and PowerShell. I've tried using cmd.exe as well just in case there was some obscure PowerShell issue but I have the same problem. I've also tried matching directly:

my $s = q(fu baz);
if ($s =~ $grammar) {
    ...
}

but I get the same problem--with the same solution.

EDIT: What I've learned.

When using the Regexp::Grammars module, if your grammar requires spaces between literals, subrules or both then you need to either encapsulate:

<foobaz>

<rule: foobaz> <foo> <baz> | my <foo> is <baz>

escape:

<foo>\ <baz> | my\ <foo>\ is\ <baz>

or insert whitespace sequences:

<foo>\s+<baz> | my\s+<foo>\s+is\s+<baz>

Upvotes: 3

Views: 165

Answers (1)

Mark Nodine
Mark Nodine

Reputation: 163

Okay, I figured out what the issue was. The top level match in a Regexp::Grammars expression is treated in token mode (whitespace not ignored) rather than in rule mode (whitespace ignored). So, to get what you want, you only need to add a top rule:

my $grammar = qr{
    <top>

    <rule: top>     <foo> <baz> |
                    my <foo> is <baz>
    <rule: foo> foo | fu | phoo
    <rule: baz> bazz?
}ix;

Here's my complete program:

use Modern::Perl;
use v5.16;

use Regexp::Grammars;
use Data::Dumper;

my $grammar = qr{
    <top>

    <rule: top>     <foo> <baz> |
                    my <foo> is <baz>
    <rule: foo> foo | fu | phoo
    <rule: baz> bazz?
}ix;

1;
while (<>) {
    chomp;

    if (/$grammar/) {
        say Dumper(\%/);
    }
    else {
        say "NO MATCH!!\n";
    }

}

Here's my output:

% echo FU baz | perl grammar.pl
$VAR1 = {
          '' => 'FU baz',
          'top' => {
                     '' => 'FU baz',
                     'baz' => 'baz',
                     'foo' => 'FU'
                   }
        };

% echo my phoo is bazz | perl grammar.pl
$VAR1 = {
          '' => 'my phoo is bazz',
          'top' => {
                     '' => 'my phoo is bazz',
                     'baz' => 'bazz',
                     'foo' => 'phoo'
                   }
        };

The documentation for Regexp::Grammars specifically states that the top level is done in token mode. Adding a top level token only adds one layer to the parse tree, but I don't think you have a choice if whitespace is to be ignored at the top level.

Upvotes: 2

Related Questions