Reputation: 75

Optimizing RegEx Pattern Built Using List of 1000 Phrases

I'm a novice in using RegEx.

I have a list of company phrases (1000+) that I am converting into a regex pattern at runtime.

Here is how I build the pattern:

    ListOfEntries.Sort()

    For i As Integer = 0 To (ListOfEntries.Count - 1)
        ListOfRegExEntries.Add("(\b(?i)" & ListOfEntries(i) & "\b)")
    Next

    RegExPatternString = "(" & String.Join("|", ListOfRegExEntries) & ")"

    RegExPattern = New Regex(RegExPatternString)

The entries are in all uppercase.

The string being matched is full name field. I simply would like to know if string contains a company keyword.

Is there anything I can do to optimize the matching process? If anyone needs more information please ask away!

Upvotes: 0

Answers (3)

Zabalba

Reputation: 75

With some of the other answers/comments it seems RegEx is not the best choice. I decided to use this code instead

Private Function ContainsOrganizationKeywordTest2() As Boolean

    With Output

        Dim BuiltFullName As String = UCase(String.Join(Space, {.PrimaryFirstName, .PrimaryMiddleName, .PrimaryLastName}))
        Dim NameParts As List(Of String) = BuiltFullName.Split(Space).ToList
        NameParts.Sort()

        For i As Integer = 0 To (NameParts.Count - 1)
            If (Not String.IsNullOrWhiteSpace(NameParts(i))) Then
                Dim Result As Integer = _OrganizationKeywords.ListOfEntries.BinarySearch(NameParts(i))
                If (Result > -1) Then
                    Return True
                End If
            End If
        Next

        Return False

    End With

End Function

Upvotes: 1

user557597

Reputation:

I don't know of any regex engine that can do debugging except Perl.
So, as analogy I'm using Perl sample code to show how to shave considerable time
when doing a regex like this.

I'm sure you could translate this code into a vb.
It basically just factors out the first letter of each phrase and creates an
array of those phrases that are joined together. I use a hash to do this,
but could easily be done by sorting all the phrases, then looping through
each one for equal first letters.

First of all, having +1000 phrases would likely have most of the letters of the alphabet
as a starting character somewhere in the phrase, so a normal trie won't help much
in a flat regex.

Then, in the case of a flat regex, every phrase must be tested until a match is made.
That's +1000 tests per character in the source string. Quite a bit of overhead.

You can immediately divide that by 26 when you factor out the first letter of each phrase.
When you do that, a secondary trie class is opened up for each letter further reducing
the overhead a significant amount of factors.

If you do it for 2 characters, it falls off to an almost negligible amount of overhead.

Below just shows a debug of a FLAT 1 level (trie) regex,
and one of a single character level factor.

To analyze the regex, follow the paths in each TRIEC-EXACTF[..] represents a termination
point (pass or fail).

You can see that the paths are significantly reduced.

Perl code:

use strict;
use warnings;
use Data::Dumper;

use re 'debug';

my @Flat_Rx_ary = ();
my @rx_ary = ();


my %LetterHash = ();

while (my $line = <DATA>)
{
    chomp( $line );
    next if ( length($line) == 0);

    push ( @Flat_Rx_ary, $line );

    my $first_char = substr( $line, 0, 1);
    my $remainder = substr( $line, 1 );
    if ( !defined( $LetterHash{ $first_char } )) {
        $LetterHash{ $first_char } = [];
    }
    push ( @{$LetterHash{ $first_char }}, $remainder );
}

print Dumper(\%LetterHash);

# Factored regex ..
my @rx_parts = ();
foreach my $rx_key ( keys %LetterHash )
{
    @{$LetterHash{ $rx_key }} = sort @{$LetterHash{ $rx_key }};
    my $rx_val = join ( '|', @{$LetterHash{ $rx_key }} );
    push ( @rx_parts, '(?:' . $rx_key . '(?:' . $rx_val . '))' );
}
my $total_rx = '(?i)\b(' . join( '|', @rx_parts ) . ')\b'; 
print $total_rx, "\n\n\n";
my $CompiledRx = qr /$total_rx/;

# Flat regex ..
@Flat_Rx_ary = sort ( @Flat_Rx_ary );
my $Flat_Total_Rx = '(?i)\b(' . join( '|',@Flat_Rx_ary) . ')\b'; 
print "\n\n\n", $Flat_Total_Rx, "\n\n\n";
my $CompiledFlatRx = qr /$Flat_Total_Rx/;

__DATA__
hello world
this is cool
good day
one day beyond
a very fine time
the end of the season
the trial of the centurn
total eclipse
game on
hello LA

Output:

$VAR1 = {
          'a' => [
                   ' very fine time'
                 ],
          'h' => [
                   'ello world',
                   'ello LA'
                 ],
          'g' => [
                   'ood day',
                   'ame on'
                 ],
          'o' => [
                   'ne day beyond'
                 ],
          't' => [
                   'his is cool',
                   'he end of the season',
                   'he trial of the centurn',
                   'otal eclipse'
                 ]
        };
(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|(?:g(?:ame on|ood da
y))|(?:o(?:ne day beyond))|(?:t(?:he end of the season|he trial of the centurn|h
is is cool|otal eclipse)))\b


Compiling REx "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Final program:
   1: BOUND (2)
   2: OPEN1 (4)
   4:   TRIEC-EXACTF[AGHOTaghot] (74)
        <a very fine time> (74)
        <h> (15)
  15:     EXACTF <ello > (18)
  18:     TRIE-EXACTF[LWlw] (74)
          <LA>
          <world>
        <g> (28)
  28:     TRIE-EXACTF[AOao] (74)
          <ame on>
          <ood day>
        <one day beyond> (74)
        <t> (48)
  48:     TRIEC-EXACTF[HOho] (74)
          <he end of the season>
          <he trial of the centurn>
          <his is cool>
          <otal eclipse>
  74: CLOSE1 (76)
  76: BOUND (77)
  77: END (0)
stclass BOUND minlen 7



(?i)\b(a very fine time|game on|good day|hello LA|hello world|one day beyond|the
 end of the season|the trial of the centurn|this is cool|total eclipse)\b


Compiling REx "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...
Final program:
   1: BOUND (2)
   2: OPEN1 (4)
   4:   TRIEC-EXACTF[AGHOTaghot] (60)
        <a very fine time>
        <game on>
        <good day>
        <hello LA>
        <hello world>
        <one day beyond>
        <the end of the season>
        <the trial of the centurn>
        <this is cool>
        <total eclipse>
  60: CLOSE1 (62)
  62: BOUND (63)
  63: END (0)
stclass BOUND minlen 7
Freeing REx: "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Freeing REx: "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...

Upvotes: 0

paparazzo

Reputation: 45096

Several issues

Create a List and then string join it is inefficient
Do it in one loop with StringBuilder
Why (?i) multiple times you can set it once in the pattern of even on the Regex
Why build up a whole list if it could be the first one
Yes I would test for str.Contains first as it is way faster
If it is not in the string move on

Upvotes: 0

Optimizing RegEx Pattern Built Using List of 1000 Phrases

Answers (3)

Related Questions