Reputation: 7680
I'm working on an app where we need to pull out key information from text. The catch is the text comes from OCRed documents, so there can be OCR recognition errors, noise, garbage characters, etc. Also, the text on the document can be in a million different formats depending on the source.
So, we use lots of regular expressions to pull out the text. We noticed that in high volume, this hammers the CPUs on the servers. I've tried pre-compiling the regexes and caching them without any improvement. Profiler shows 65% of the runtime is due to calling Regex.Match().
Reading up on regex, I see catastrophic backtracking is a performance issue.
Let's say I have an expression like this (this is a simple one just to illustrate the general format of our regexes -- others can contain many more keywords and formats):
(.*) KEYWORD1 AND (.* KEYWORD2)
When I step through with Regex Coach, I see it does a lot of backtracking to match the string.
Can this type of regex be conceptually improved? We do only run against a subset of the entire document (a smaller blob of text), but the preprocessing to pull out the blob isn't perfect either by nature.
So, yeah, pretty much anything can appear before "KEYWORD1" and anything can appear in front of "KEYWORD2", etc.
We can't restrict to A-Z instead of .*, since in the OCR world, letters can sometimes be mistake for numbers (i.e. Illene = I11ene), or we can get garbage characters in there due to OCR recognition errors.
Upvotes: 2
Views: 85
Reputation: 179779
Yes, these types can be easily optimized.
You optimize them by replacing the regex with the intended code. That is to say, two substring searches. If the position of " KEYWORD1 AND "
is smaller than that of "KEYWORD2"
then you've got a match.
For extra speed, you can use optimized substring searches, but that's almost certainly not needed. Just eliminating the regex will give a massive speed boost.
[edit]
Ok, so there are 400. And some of them are slightly more complicated. The pattern remains the same: substantial substrings with little variation, that can be effectively located. If you know that "PART OF"
occurs in your input, checking whether " PART OF"
occurs can be done in approximately one nanosecond. And if PARTF OF_ _doesn't_ occur, you don't need to check at all whether
AS PART OF` occurs.
Now 400 regexes is not much. If you had 40.000, it would be worthwhile to automate checking for common substrings. At the moment, you might just run each regex in turn, trying to match the other 399 regex strings to get a first cut. .*PART OF.*
will match ".*AS PART OF.*"
.
For the same reason, you don't need other optimizations as well. With 40.000 regexes to match, I'd calculate the frequency of each letter pair. I.e. the input FOO AS PART OF BAR
has letter pairs FO, OO, PA, AR (twice), RT, OF, BA
. This cannot match .*FOR EXAMPLE.*
as the letter pair EX
is missing. right
Upvotes: 3