Håkon Hægland
Håkon Hægland

Reputation: 40778

Backtracking in regexp faster than expected

According to perlre, the following code should take several seconds to execute:

$ time perl -E '$_="((()" . ("a") x 18;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.006s
user    0m0.000s
sys 0m0.005s

The documentation says:

Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but that each extra letter doubles this time.

As seen, it only takes a fraction of a second on my laptop.. Even running with a million a's completes in half a second:

$ time perl -E '$_="((()" . ("a") x 1000000;  say "ok" if m{ \(([^()]+|\( [^()]* \))+\)}x;'

real    0m0.454s
user    0m0.446s
sys 0m0.008s

What am I missing here?

Upvotes: 8

Views: 1808

Answers (2)

David Hammen
David Hammen

Reputation: 33126

Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds.

This sentence dates back to at least April 2001, when perl 5.6.1 was released. You can see the perlre man page for that release at search.cpan.org/~gsar/perl-5.6.1/pod/perlre.pod.

This is perhaps a lesson to be learned here in documenting software. Be careful what you write as your written words will remain unchanged despite years of improvements and multiple forks by others.

Upvotes: 2

Sobrique
Sobrique

Reputation: 53498

One of the tricks you can do to figure out what the regex engine is doing, is:

use re 'debug'; 

E.g.:

use strict;
use warnings;
use re 'debug'; 

my $str = "a" x 18;

$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

This will then print what the regex engine is actually doing:

Compiling REx " \(([^()]+|\( [^()]* \))+\)"
Final program:
   1: EXACT <(> (3)
   3: CURLYX[0] {1,32767} (40)
   5:   OPEN1 (7)
   7:     BRANCH (20)
   8:       PLUS (37)
   9:         ANYOF[^()][{above_bitmap_all}] (0)
  20:     BRANCH (FAIL)
  21:       EXACT <(> (23)
  23:       STAR (35)
  24:         ANYOF[^()][{above_bitmap_all}] (0)
  35:       EXACT <)> (37)
  37:   CLOSE1 (39)
  39: WHILEM[1/3] (0)
  40: NOTHING (41)
  41: EXACT <)> (43)
  43: END (0)
anchored "(" at 0 floating ")" at 2..2147483647 (checking floating) minlen 3 
Matching REx " \(([^()]+|\( [^()]* \))+\)" against "aaaaaaaaaaaaaaaaaa"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [2..18] gave -1
  Did not find floating substr ")"...
Match rejected by optimizer
Freeing REx: " \(([^()]+|\( [^()]* \))+\)"

If you add your brackets back in, you get a different result - I get around 2000 steps to process the regex. This gets longer with each additional letter - about 300 steps.

So I would say yes - catastrophic backtracking is occurring, but you may well find that processors (and regex engine optimisations) mean the time is considerably shorter.

use strict;
use warnings;
use re 'debug'; 

my $str = "((()"."a" x 100_000;
$str =~ m{ \(([^()]+|\( [^()]* \))+\)}x;

Runs considerably longer - but at least a proportion of it is because printing text to the screen is actually quite 'expensive' comparatively.

My test suggests (number of "a"s)

10 : 1221 lines of output (broadly correlated with regex steps)
11 : 1324 (+103)
12 : 1467 (+143)
13 : 1590 (+129)
14 : 1728 (+138)
15 : 1852 (+124)

20 : 2630 (approx +155/extra)
30 : 4536 (+190/extra)
40 : 6940 (+240/extra)
50 : 9846 (+290/extra)

100 - 31,846 (+440/extra letter)

So certainly looks like exponential behaviour going on - but in neither case, the processing time is still pretty quick, which I'd attribute to faster cpus (and maybe better optimisation of the regex engine)

Upvotes: 6

Related Questions