Zabalba
Zabalba

Reputation: 75

Optimizing RegEx Pattern Built Using List of 1000 Phrases

I'm a novice in using RegEx.

I have a list of company phrases (1000+) that I am converting into a regex pattern at runtime.

Here is how I build the pattern:

    ListOfEntries.Sort()

    For i As Integer = 0 To (ListOfEntries.Count - 1)
        ListOfRegExEntries.Add("(\b(?i)" & ListOfEntries(i) & "\b)")
    Next

    RegExPatternString = "(" & String.Join("|", ListOfRegExEntries) & ")"

    RegExPattern = New Regex(RegExPatternString)

The entries are in all uppercase.

The string being matched is full name field. I simply would like to know if string contains a company keyword.

Is there anything I can do to optimize the matching process? If anyone needs more information please ask away!

Upvotes: 0

Views: 84

Answers (3)

Zabalba
Zabalba

Reputation: 75

With some of the other answers/comments it seems RegEx is not the best choice. I decided to use this code instead

Private Function ContainsOrganizationKeywordTest2() As Boolean

    With Output

        Dim BuiltFullName As String = UCase(String.Join(Space, {.PrimaryFirstName, .PrimaryMiddleName, .PrimaryLastName}))
        Dim NameParts As List(Of String) = BuiltFullName.Split(Space).ToList
        NameParts.Sort()

        For i As Integer = 0 To (NameParts.Count - 1)
            If (Not String.IsNullOrWhiteSpace(NameParts(i))) Then
                Dim Result As Integer = _OrganizationKeywords.ListOfEntries.BinarySearch(NameParts(i))
                If (Result > -1) Then
                    Return True
                End If
            End If
        Next

        Return False

    End With

End Function

Upvotes: 1

user557597
user557597

Reputation:

I don't know of any regex engine that can do debugging except Perl.
So, as analogy I'm using Perl sample code to show how to shave considerable time
when doing a regex like this.

I'm sure you could translate this code into a vb.
It basically just factors out the first letter of each phrase and creates an
array of those phrases that are joined together. I use a hash to do this,
but could easily be done by sorting all the phrases, then looping through
each one for equal first letters.

First of all, having +1000 phrases would likely have most of the letters of the alphabet
as a starting character somewhere in the phrase, so a normal trie won't help much
in a flat regex.

Then, in the case of a flat regex, every phrase must be tested until a match is made.
That's +1000 tests per character in the source string. Quite a bit of overhead.

You can immediately divide that by 26 when you factor out the first letter of each phrase.
When you do that, a secondary trie class is opened up for each letter further reducing
the overhead a significant amount of factors.

If you do it for 2 characters, it falls off to an almost negligible amount of overhead.

Below just shows a debug of a FLAT 1 level (trie) regex,
and one of a single character level factor.

To analyze the regex, follow the paths in each TRIEC-EXACTF[..] represents a termination
point (pass or fail).

You can see that the paths are significantly reduced.

Perl code:

use strict;
use warnings;
use Data::Dumper;

use re 'debug';

my @Flat_Rx_ary = ();
my @rx_ary = ();


my %LetterHash = ();

while (my $line = <DATA>)
{
    chomp( $line );
    next if ( length($line) == 0);

    push ( @Flat_Rx_ary, $line );

    my $first_char = substr( $line, 0, 1);
    my $remainder = substr( $line, 1 );
    if ( !defined( $LetterHash{ $first_char } )) {
        $LetterHash{ $first_char } = [];
    }
    push ( @{$LetterHash{ $first_char }}, $remainder );
}

print Dumper(\%LetterHash);

# Factored regex ..
my @rx_parts = ();
foreach my $rx_key ( keys %LetterHash )
{
    @{$LetterHash{ $rx_key }} = sort @{$LetterHash{ $rx_key }};
    my $rx_val = join ( '|', @{$LetterHash{ $rx_key }} );
    push ( @rx_parts, '(?:' . $rx_key . '(?:' . $rx_val . '))' );
}
my $total_rx = '(?i)\b(' . join( '|', @rx_parts ) . ')\b'; 
print $total_rx, "\n\n\n";
my $CompiledRx = qr /$total_rx/;

# Flat regex ..
@Flat_Rx_ary = sort ( @Flat_Rx_ary );
my $Flat_Total_Rx = '(?i)\b(' . join( '|',@Flat_Rx_ary) . ')\b'; 
print "\n\n\n", $Flat_Total_Rx, "\n\n\n";
my $CompiledFlatRx = qr /$Flat_Total_Rx/;

__DATA__
hello world
this is cool
good day
one day beyond
a very fine time
the end of the season
the trial of the centurn
total eclipse
game on
hello LA

Output:

$VAR1 = {
          'a' => [
                   ' very fine time'
                 ],
          'h' => [
                   'ello world',
                   'ello LA'
                 ],
          'g' => [
                   'ood day',
                   'ame on'
                 ],
          'o' => [
                   'ne day beyond'
                 ],
          't' => [
                   'his is cool',
                   'he end of the season',
                   'he trial of the centurn',
                   'otal eclipse'
                 ]
        };
(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|(?:g(?:ame on|ood da
y))|(?:o(?:ne day beyond))|(?:t(?:he end of the season|he trial of the centurn|h
is is cool|otal eclipse)))\b


Compiling REx "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Final program:
   1: BOUND (2)
   2: OPEN1 (4)
   4:   TRIEC-EXACTF[AGHOTaghot] (74)
        <a very fine time> (74)
        <h> (15)
  15:     EXACTF <ello > (18)
  18:     TRIE-EXACTF[LWlw] (74)
          <LA>
          <world>
        <g> (28)
  28:     TRIE-EXACTF[AOao] (74)
          <ame on>
          <ood day>
        <one day beyond> (74)
        <t> (48)
  48:     TRIEC-EXACTF[HOho] (74)
          <he end of the season>
          <he trial of the centurn>
          <his is cool>
          <otal eclipse>
  74: CLOSE1 (76)
  76: BOUND (77)
  77: END (0)
stclass BOUND minlen 7



(?i)\b(a very fine time|game on|good day|hello LA|hello world|one day beyond|the
 end of the season|the trial of the centurn|this is cool|total eclipse)\b


Compiling REx "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...
Final program:
   1: BOUND (2)
   2: OPEN1 (4)
   4:   TRIEC-EXACTF[AGHOTaghot] (60)
        <a very fine time>
        <game on>
        <good day>
        <hello LA>
        <hello world>
        <one day beyond>
        <the end of the season>
        <the trial of the centurn>
        <this is cool>
        <total eclipse>
  60: CLOSE1 (62)
  62: BOUND (63)
  63: END (0)
stclass BOUND minlen 7
Freeing REx: "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Freeing REx: "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...

Upvotes: 0

paparazzo
paparazzo

Reputation: 45096

Several issues

  • Create a List and then string join it is inefficient
    Do it in one loop with StringBuilder
  • Why (?i) multiple times you can set it once in the pattern of even on the Regex
  • Why build up a whole list if it could be the first one
  • Yes I would test for str.Contains first as it is way faster
    If it is not in the string move on

Upvotes: 0

Related Questions