Handling Whitespace in the Regexp::Grammars Module

Question

I have a grammar that I'm trying to parse with the help of Regexp::Grammars but for some reason it looks like it's having a whitespace problem. I've managed to reduce it to the following:

use Modern::Perl;
use v5.16;

use Regexp::Grammars;
use Data::Dumper;

my $grammar = qr{ 
      | my  is 

     foo | fu | phoo
     bazz?
}ix;

while (<>) {
    chomp;

    if (/$grammar/) {
        say Dumper(\%/);
    }
    else {
        say "NO MATCH!!
";
    }

}

When the program is run and any matching sequence such as

foo baz
phoo bazz
my fu is baz

is entered the program returns

NO MATCH!!

However, if I insert a debugging directive before the grammar definition:


  | my  is 
...

I get what I expect:

perl.exe : ========> Trying  from position 0
At line:1 char:5
+ perl <<<<  .	est_grammar2.pl 2>&1 > output.txt
    + CategoryInfo          : NotSpecified: (========> Tryin...from position 0:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

phoo bazz |...Trying     

          |   |...Trying subpattern /foo/    
          |   |    \FAIL subpattern /foo/
          |   |...Trying next alternative    
          |   |...Trying subpattern /fu/    
          |   |    \FAIL subpattern /fu/
          |   |...Trying next alternative    
          |   |...Trying subpattern /phoo/    
 bazz     |   |    \_____subpattern /phoo/ matched 'phoo'    
          |    \_____ matched 'phoo'    
          |...Trying     
          |   |...Trying subpattern /bazz?/    
[eos]     |   |    \_____subpattern /bazz?/ matched 'bazz'    
          |    \_____ matched ' bazz'    
           \_____ matched 'phoo bazz' 

$VAR1 = {
          '' => 'phoo baz',
          baz => ' bazz',
          foo => 'phoo'
        };

Similarly, if I put an optional whitespace sequence between the subrule and literal calls:

\s* ...
...

I also get a match.

I'm using Winodws 7, ActivePerl Build 1603, Perl 5.16.3 and PowerShell. I've tried using cmd.exe as well just in case there was some obscure PowerShell issue but I have the same problem. I've also tried matching directly:

my $s = q(fu baz);
if ($s =~ $grammar) {
    ...
}

but I get the same problem--with the same solution.

EDIT: What I've learned.

When using the Regexp::Grammars module, if your grammar requires spaces between literals, subrules or both then you need to either encapsulate:



   | my  is

escape:

\  | my\ \ is\

or insert whitespace sequences:

\s+ | my\s+\s+is\s+

Mark Nodine · Accepted Answer

Okay, I figured out what the issue was. The top level match in a Regexp::Grammars expression is treated in token mode (whitespace not ignored) rather than in rule mode (whitespace ignored). So, to get what you want, you only need to add a top rule:

my $grammar = qr{
    

           |
                    my  is 
     foo | fu | phoo
     bazz?
}ix;

Here's my complete program:

use Modern::Perl;
use v5.16;

use Regexp::Grammars;
use Data::Dumper;

my $grammar = qr{
    

           |
                    my  is 
     foo | fu | phoo
     bazz?
}ix;

1;
while (<>) {
    chomp;

    if (/$grammar/) {
        say Dumper(\%/);
    }
    else {
        say "NO MATCH!!
";
    }

}

Here's my output:

% echo FU baz | perl grammar.pl
$VAR1 = {
          '' => 'FU baz',
          'top' => {
                     '' => 'FU baz',
                     'baz' => 'baz',
                     'foo' => 'FU'
                   }
        };

% echo my phoo is bazz | perl grammar.pl
$VAR1 = {
          '' => 'my phoo is bazz',
          'top' => {
                     '' => 'my phoo is bazz',
                     'baz' => 'bazz',
                     'foo' => 'phoo'
                   }
        };

The documentation for Regexp::Grammars specifically states that the top level is done in token mode. Adding a top level token only adds one layer to the parse tree, but I don't think you have a choice if whitespace is to be ignored at the top level.

Handling Whitespace in the Regexp::Grammars Module

Answers (1)

Related Questions