Hugo Zaragoza
Hugo Zaragoza

Reputation: 601

Poor performance / speed of regex with lookahead

I have been observing extremely slow execution times with expressions with several lookaheads.

I suppose that this is due to underlying data structures, but it seems pretty extreme and I wonder if I do something wrong or if there are known work-arounds.

The problem is determining if a set of words are present in a string, in any order. For example we want to find out if two terms "term1" AND "term2" are somewhere in a string. I do this with the expresion:

(?=.*\bterm1\b)(?=.*\bterm2\b)

But what I observe is that this is an order of magnitude slower than checking first just

\bterm1\b

and just then

\bterm2\b

This seems to indicate that I should use an array of patterns instead of a single pattern with lookaheads... is this right? it seems wrong...

Here is an example test code and resulting times:

public static void speedLookAhead() {
    Matcher m, m1, m2;
    boolean find;
    int its = 1000000;

    // create long non-matching string
    char[] str = new char[2000];
    for (int i = 0; i < str.length; i++) {
      str[i] = 'x';
    }
    String test = str.toString();

    // First method: use one expression with lookaheads
    m = Pattern.compile("(?=.*\\bterm1\\b)(?=.*\\bterm2\\b)").matcher(test);
    long time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m.reset(test);
      find = m.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);

    // Second method: use two expressions and AND the results
    m1 = Pattern.compile("\\bterm1\\b").matcher(test);
    m2 = Pattern.compile("\\bterm2\\b").matcher(test);
    time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m1.reset(test);
      m2.reset(test);
      find = m1.find() && m2.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);
  } 

This outputs in my computer:

1754
150

Upvotes: 2

Views: 3886

Answers (3)

user557597
user557597

Reputation:

This might shave some time

greedy
([AB]).*(?!\1)[AB]

non-greedy
([AB]).*?(?!\1)[AB]

redo

I've made my own benches on this issue. Matching a single term at a time like /term/
as opposed to two terms in one regex will always take less time because it does no
backtracking. Its as easy as strncmp(term). And doing 2 terms seperately will then be
much faster.

If you can define the terms such that there is no possiblity of overlap, then this is
the way to go. ie; /term1/ && /term2/.

There is no way to combine terms into a single regex without invoking backtracking.

That is, if you actually care about overlap, then there are techniques to minimize
backtracking.

/(?=.*A)(?=.*B)/ is just like /A/ && /B/ except it seems a whole lot slower, neither account for overlap.

So, if you actually care about overlap (and I would strongly recommend you do), there is
are two ways that could be combined for maximum efficiency.

/(A|B) .* (?!\1)(?:A|B)/

or

/A/ && /B/ && /(A|B) .* (?!\1)(?:A|B)/

This last one will add a small (relative) overhead, but can inhibit access in the logic
chain, requiring A and B to at least exist, before checking for overlap.

And, depending on where A and B exist in the string, /(A|B) .* (?!\1)(?:A|B)/
could also take time, but its still the shortest way that there is when everything
averages out.

Below is a Perl program that benchmarks some sample (possible scenarios) strings.

Good luck!

use strict;
use warnings;

use Benchmark ':hireswallclock';
my ($t0,$t1);

my ($term1, $term2) = ('term','m2a');
my  @samples = (
   ' xaaaaaaa term2ater  ',
   ' xaaaaaaa term2aterm ',
   ' xaaaaaaa ter2ater  ',
   ' Aaa term2ater ' . 'x 'x100 . 'xaaaaaaa mta ',
   ' Baa term      ' . 'x 'x100 . 'xaaaaaaa mta ',
   ' Caa m2a       ' . 'x 'x100 . 'xaaaaaaa term ',
   ' Daa term2a       ' . 'x 'x100 . 'xaaaaaaa term ',
);

my $rxA  = qr/$term1/;
my $rxB  = qr/$term2/;
my $rxAB = qr/ ($term1|$term2) .* (?!\1)(?:$term1|$term2) /x;


for (@samples)
{
    printf "Checking string:  '%.40s'\n-------------\n", $_;

    if (/$term1/ && /$term2/ ) {
       print "  Found possible candidates (A && B)\n";
    }
    if (/ ($term1|$term2) .* ((?!\1)(?:$term1|$term2)) /x) {
       print "  Found non-overlaped terms: '$1'  '$2'\n";
    }
    else {
       print "  No (A|B) .* (?!\\1)(A|B) terms found!\n";
    }
    print "\n   Bench\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxA/  &&  /$rxB/;
    }
    $t1 = new Benchmark;
    print "    $rxA && $rxB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxAB/;
    }
    $t1 = new Benchmark;
    print "    $rxAB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxA/  &&  /$rxB/ && /$rxAB/;
    }
    $t1 = new Benchmark;
    print "    $rxA && $rxB &&\n    $rxAB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

}

Output

Checking string:  ' xaaaaaaa term2ater  '
-------------
  Found possible candidates (A && B)
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.46875 wallclock secs ( 1.47 usr +  0.00 sys =  1.47 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.3748 wallclock secs ( 3.34 usr +  0.00 sys =  3.34 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 5.0623 wallclock secs ( 5.06 usr +  0.00 sys =  5.06 CPU)

Checking string:  ' xaaaaaaa term2aterm '
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.48403 wallclock secs ( 1.49 usr +  0.00 sys =  1.49 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.89044 wallclock secs ( 3.89 usr +  0.00 sys =  3.89 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 5.40607 wallclock secs ( 5.38 usr +  0.00 sys =  5.38 CPU)

Checking string:  ' xaaaaaaa ter2ater  '
-------------
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 0.765321 wallclock secs ( 0.77 usr +  0.00 sys =  0.77 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 1.29674 wallclock secs ( 1.30 usr +  0.00 sys =  1.30 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 0.874842 wallclock secs ( 0.88 usr +  0.00 sys =  0.88 CPU)

Checking string:  ' Aaa term2ater x x x x x x x x x x x x x'
-------------
  Found possible candidates (A && B)
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.46842 wallclock secs ( 1.47 usr +  0.00 sys =  1.47 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 28.078 wallclock secs (28.08 usr +  0.00 sys = 28.08 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 29.4531 wallclock secs (29.45 usr +  0.00 sys = 29.45 CPU)

Checking string:  ' Baa term      x x x x x x x x x x x x x'
-------------
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.68716 wallclock secs ( 1.69 usr +  0.00 sys =  1.69 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 15.1563 wallclock secs (15.16 usr +  0.00 sys = 15.16 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 1.64033 wallclock secs ( 1.64 usr +  0.00 sys =  1.64 CPU)

Checking string:  ' Caa m2a       x x x x x x x x x x x x x'
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.62448 wallclock secs ( 1.63 usr +  0.00 sys =  1.63 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.0154 wallclock secs ( 3.02 usr +  0.00 sys =  3.02 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 4.56226 wallclock secs ( 4.56 usr +  0.00 sys =  4.56 CPU)

Checking string:  ' Daa term2a       x x x x x x x x x x x '
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.45252 wallclock secs ( 1.45 usr +  0.00 sys =  1.45 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 16.1404 wallclock secs (16.14 usr +  0.00 sys = 16.14 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 17.6719 wallclock secs (17.67 usr +  0.00 sys = 17.67 CPU)

Upvotes: 2

Peter Lawrey
Peter Lawrey

Reputation: 533520

You need to place each loop in a separate method, if swap the order of the tests you will get different results.

Can you compare this with test.indexOf('A') >= 0 && test.indexOf('B') >= 0 as I imagine this could be much faster?

Upvotes: 1

buckley
buckley

Reputation: 14089

The regex you posted

(?=.\A\b)(?=.\B\b)

Does not match the one within the code

.(?=.*B)(?=.*A)

In fact the first regex is something that can't possibly match it seems.

Can you give some sample input of things that should match and which that don't.

This is the regex from the code explained.

Match any single character that is not a line break character «.»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*B)»
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character “B” literally «B»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*A)»
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character “A” literally «A»

Upvotes: 1

Related Questions