Reputation: 75
I'm a novice in using RegEx.
I have a list of company phrases (1000+) that I am converting into a regex pattern at runtime.
Here is how I build the pattern:
ListOfEntries.Sort()
For i As Integer = 0 To (ListOfEntries.Count - 1)
ListOfRegExEntries.Add("(\b(?i)" & ListOfEntries(i) & "\b)")
Next
RegExPatternString = "(" & String.Join("|", ListOfRegExEntries) & ")"
RegExPattern = New Regex(RegExPatternString)
The entries are in all uppercase.
The string being matched is full name field. I simply would like to know if string contains a company keyword.
Is there anything I can do to optimize the matching process? If anyone needs more information please ask away!
Upvotes: 0
Views: 84
Reputation: 75
With some of the other answers/comments it seems RegEx is not the best choice. I decided to use this code instead
Private Function ContainsOrganizationKeywordTest2() As Boolean
With Output
Dim BuiltFullName As String = UCase(String.Join(Space, {.PrimaryFirstName, .PrimaryMiddleName, .PrimaryLastName}))
Dim NameParts As List(Of String) = BuiltFullName.Split(Space).ToList
NameParts.Sort()
For i As Integer = 0 To (NameParts.Count - 1)
If (Not String.IsNullOrWhiteSpace(NameParts(i))) Then
Dim Result As Integer = _OrganizationKeywords.ListOfEntries.BinarySearch(NameParts(i))
If (Result > -1) Then
Return True
End If
End If
Next
Return False
End With
End Function
Upvotes: 1
Reputation:
I don't know of any regex engine that can do debugging except Perl.
So, as analogy I'm using Perl sample code to show how to shave considerable time
when doing a regex like this.
I'm sure you could translate this code into a vb.
It basically just factors out the first letter of each phrase and creates an
array of those phrases that are joined together. I use a hash to do this,
but could easily be done by sorting all the phrases, then looping through
each one for equal first letters.
First of all, having +1000 phrases would likely have most of the letters of the alphabet
as a starting character somewhere in the phrase, so a normal trie won't help much
in a flat regex.
Then, in the case of a flat regex, every phrase must be tested until a match is made.
That's +1000 tests per character in the source string. Quite a bit of overhead.
You can immediately divide that by 26 when you factor out the first letter of each phrase.
When you do that, a secondary trie class is opened up for each letter further reducing
the overhead a significant amount of factors.
If you do it for 2 characters, it falls off to an almost negligible amount of overhead.
Below just shows a debug of a FLAT 1 level (trie) regex,
and one of a single character level factor.
To analyze the regex, follow the paths in each TRIEC-EXACTF[..]
represents a termination
point (pass or fail).
You can see that the paths are significantly reduced.
Perl code:
use strict;
use warnings;
use Data::Dumper;
use re 'debug';
my @Flat_Rx_ary = ();
my @rx_ary = ();
my %LetterHash = ();
while (my $line = <DATA>)
{
chomp( $line );
next if ( length($line) == 0);
push ( @Flat_Rx_ary, $line );
my $first_char = substr( $line, 0, 1);
my $remainder = substr( $line, 1 );
if ( !defined( $LetterHash{ $first_char } )) {
$LetterHash{ $first_char } = [];
}
push ( @{$LetterHash{ $first_char }}, $remainder );
}
print Dumper(\%LetterHash);
# Factored regex ..
my @rx_parts = ();
foreach my $rx_key ( keys %LetterHash )
{
@{$LetterHash{ $rx_key }} = sort @{$LetterHash{ $rx_key }};
my $rx_val = join ( '|', @{$LetterHash{ $rx_key }} );
push ( @rx_parts, '(?:' . $rx_key . '(?:' . $rx_val . '))' );
}
my $total_rx = '(?i)\b(' . join( '|', @rx_parts ) . ')\b';
print $total_rx, "\n\n\n";
my $CompiledRx = qr /$total_rx/;
# Flat regex ..
@Flat_Rx_ary = sort ( @Flat_Rx_ary );
my $Flat_Total_Rx = '(?i)\b(' . join( '|',@Flat_Rx_ary) . ')\b';
print "\n\n\n", $Flat_Total_Rx, "\n\n\n";
my $CompiledFlatRx = qr /$Flat_Total_Rx/;
__DATA__
hello world
this is cool
good day
one day beyond
a very fine time
the end of the season
the trial of the centurn
total eclipse
game on
hello LA
Output:
$VAR1 = {
'a' => [
' very fine time'
],
'h' => [
'ello world',
'ello LA'
],
'g' => [
'ood day',
'ame on'
],
'o' => [
'ne day beyond'
],
't' => [
'his is cool',
'he end of the season',
'he trial of the centurn',
'otal eclipse'
]
};
(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|(?:g(?:ame on|ood da
y))|(?:o(?:ne day beyond))|(?:t(?:he end of the season|he trial of the centurn|h
is is cool|otal eclipse)))\b
Compiling REx "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Final program:
1: BOUND (2)
2: OPEN1 (4)
4: TRIEC-EXACTF[AGHOTaghot] (74)
<a very fine time> (74)
<h> (15)
15: EXACTF <ello > (18)
18: TRIE-EXACTF[LWlw] (74)
<LA>
<world>
<g> (28)
28: TRIE-EXACTF[AOao] (74)
<ame on>
<ood day>
<one day beyond> (74)
<t> (48)
48: TRIEC-EXACTF[HOho] (74)
<he end of the season>
<he trial of the centurn>
<his is cool>
<otal eclipse>
74: CLOSE1 (76)
76: BOUND (77)
77: END (0)
stclass BOUND minlen 7
(?i)\b(a very fine time|game on|good day|hello LA|hello world|one day beyond|the
end of the season|the trial of the centurn|this is cool|total eclipse)\b
Compiling REx "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...
Final program:
1: BOUND (2)
2: OPEN1 (4)
4: TRIEC-EXACTF[AGHOTaghot] (60)
<a very fine time>
<game on>
<good day>
<hello LA>
<hello world>
<one day beyond>
<the end of the season>
<the trial of the centurn>
<this is cool>
<total eclipse>
60: CLOSE1 (62)
62: BOUND (63)
63: END (0)
stclass BOUND minlen 7
Freeing REx: "(?i)\b((?:a(?: very fine time))|(?:h(?:ello LA|ello world))|"...
Freeing REx: "(?i)\b(a very fine time|game on|good day|hello LA|hello worl"...
Upvotes: 0
Reputation: 45096
Several issues
Upvotes: 0