perl6 Regex subrules and named regex MUCH MUCH slower than explicit regex; how to make them equally fast?

Question

I have a data file with 1608240 lines. The file is in sections. Each section has a unique word in the begin line, all sections have the same word "doneSection" in the last line of the section.

I am trying to fish out some sections by doing the following (code reformatted by @raiph from original post, to make code easier to interpret):

# using named subrules/regex is EXTREMELY slow;
# it reads about 2 lines per second, and grinds to halt
# after about 500 lines: (>> is the right word boundary)
perl6 -e 'my regex a { [ <{.join("||")}> ] };
          my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+  >>/ or
                    (m/:i \s+  \s+ /
                     ff
                     m/:i doneSection/);
          }'

# however, if I explicitly write out the regex instead of using a subrule,
# it reads about 1000 lines per second, and it gets the job done:
perl6 -e 'my $x = 0;
          for "/tmp/DataRaw".IO.lines {
            $*ERR.print( "$x 1608240 \r" );
            ++$x;
            .say if m/:i beginSection \s+
                         [ iron || copper || carbon ] >>/ or
                    (m/:i \s+
                         [ iron || copper || carbon ] \s+ /
                     ff
                     m/:i doneSection/);
          }'

My question is, how to make subrule as fast as explicit regex, or at least not grind to a halt? I prefer using higher level of abstraction. Is this a regex engine memory problem? I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

and it is equally slow.

I cannot post the 1.6 million line of my data file, but you can probably generate a similar file for testing purposes.

Thanks for any hints.

raiph · Accepted Answer

The problem isn't use of subrules / naming regexes. It's what's inside the regex. It's:

[ <{.join("||")}> ]

vs

[ iron || copper || carbon ]

The following should eliminate the speed difference. Please try it and comment on your results:

my regex a { || < iron copper carbon > }

Note the leading whitespace in < iron copper ... rather than . The latter means a subrule called iron with the arguments copper etc. The former means a "quotewords" list literal just as it does in the main language (though the leading whitespace is optional in the main language).¹

The list of matchers can be put in an array variable:

my @matchers = < iron copper carbon >;
my regex a { || @matchers }

The matchers in @matchers can be arbitrary regexes not just strings:

my @matchers = / i..n /, / cop+er /, / carbon /;
my regex a { || @matchers }

Warning: The above works but while writing this answer I encountered and have now golf'd the issue that @ symbol'd array interpolation doesn't backtrack.

how to make subrule as fast as explicit regex

It's not about it being explicit. It's about regex interpolation that involves run-time evaluation.

In general, P6 regexes are written in their own regex language¹ that is compiled at compile-time by default.

But the P6 regex language includes the ability to inject code that is then evaluated at run-time (provided it's not dangerous).²

This can be useful but incurs run-time overhead which can sometimes be significant.

(It's also possible you've got some bad Big O algorithmic peformance going on related to your use of the run-time evaluation. If so it becomes even worse than just run-time interpolation because it's then a Big O problem. I've not bothered to analyze that because it's best just to use fully compiled regexes as per my code above.)

I have also tried using:

my $a=rx/ [ <{ < iron copper carbon > .join("||") }> ] /

That still doesn't avoid run-time interpolation. This construct:

<{ ...  }>

interpolates by evaluating the code inside the braces at run-time and then injecting that into the outer regex.

Footnotes

¹ The P6 "language" is actually an interwoven collection of DSLs.

² Unless you explicitly write a use MONKEY-SEE-NO-EVAL; (or just use MONKEY;) pragma to take responsibility for injection attacks, the interpolation of a regex containing injected strings is limited at compile-time to ensure injection attacks aren't possible and P6 will refuse to run the code if it is. The code you've written isn't subject to attacks so the compiler let you write it as you have done and compiled the code without fuss.

perl6 Regex subrules and named regex MUCH MUCH slower than explicit regex; how to make them equally fast?

Answers (1)

Footnotes

Related Questions