Erik Olson
Erik Olson

Reputation: 195

Optimizing a regex filled with '?'

On the stenographic keyboard, there are the keys STKPWHRAO*EUFRPBLGTSDZ. The user presses several keys, then the keys are registered all at once when lifted. It's similar to playing chords on a piano. Example strokes are KAT, TPHOEUGT.

I have a regex which tests for valid steno chords. It can be any number of these keys but they must be in that order. My solution is qr/S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?/ but since this regex gets called hundreds of times, the variable length might be a speed bottleneck. Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?

Is there a faster regex approach to this? I need the regex to fail if keys are out of order.

Upvotes: 3

Views: 146

Answers (2)

ikegami
ikegami

Reputation: 386321

To check if a string is a valid chord, you'd actually need

/^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s

A simple optimization would be to make sure a match is possible.

/^(?=[STKPWHRAO*EUFBLGDZ])S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s

The next step is to eliminate backtracking. That's where time is being lost.

/
   ^
   (?=[STKPWHRAO*EUFBLGDZ])
   S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
   U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+  D?+ Z?+
   \z
/x

Fortunately, even though S, T, P and R appear twice, backtracking could be completely eliminated without trouble. This should virtually the matching time to virtually nothing.

If even that isn't fast enough, the next step is writing a specialized C function. Starting the regex matching engine is expensive, and completely avoidable with a simple function.

Note that the above optimizations only help when the pattern doesn't match. They should be neutral when the pattern matches. The C function, on the other hand, would help even when then pattern matches.

Benchmarks:

use strict;
use warnings;
use feature qw( say );

use Benchmark qw( cmpthese );

my %tests = (
   orig => q{ $s =~ /^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s},
   new  => q{ $s =~
      /
         ^
         (?=[STKPWHRAO*EUFBLGDZ])
         S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
         U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+  D?+ Z?+
         \z
      /x
   },
);

$_ = 'use strict; use warnings; our $s; ' . $_
   for values %tests;

{ say "Matching:";     local our $s = "STAODZ";   cmpthese(-3, \%tests); }
{ say "Not matching:"; local our $s = "STPRSTPR"; cmpthese(-3, \%tests); }

Output:

Matching:
         Rate  new orig
new  509020/s   -- -29%
orig 712274/s  40%   --
Not matching:
         Rate orig  new
orig 158758/s   -- -73%
new  579851/s 265%   --

Which means
matching slowed from 1.40μs to 1.96μs (in this case), and
non-matching speed up from 6.30μs to 1.72μs (in this case).


To check if a string is a sequence of valid chords, you'd simply need

/^[STKPWHRAO*EUFBLGDZ]+\z/

If you want to extract all the chords in a string, I'd start by extracting the sequences matched by the following, then finding the chords within the extracted sequences:

/([STKPWHRAO*EUFBLGDZ]+)/

Upvotes: 5

Borodin
Borodin

Reputation: 126742

the variable length might be a speed bottleneck

You shouldn't work like that

  • First, write and debug your program

  • then, if it isn't fast enough for it's purpose, profile your program to find where the bottlenecks are

  • then optimise the bottlenecks

For goodness sake don't spend ages trying to guess where the bottlenecks are and optimising them before your code is complete, as you will more than likely find that you have guessed wrongly and wasted a lot of time

In any case, the regex engine is written in C and is pretty damn fast. I doubt very much whether the short pattern that you have written will take a significant amount of time to test

Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?

That isn't true either. At each point in the regex there is only one character to test. The next character in the string either matches it or it doesn't. Either is fine, and the regex engine just goes on to the next step in the pattern. The matching process will be pretty much constant regardless of the string to be matched.

Upvotes: 0

Related Questions