Can these types of Regexs be optimized?

Question

I'm working on an app where we need to pull out key information from text. The catch is the text comes from OCRed documents, so there can be OCR recognition errors, noise, garbage characters, etc. Also, the text on the document can be in a million different formats depending on the source.

So, we use lots of regular expressions to pull out the text. We noticed that in high volume, this hammers the CPUs on the servers. I've tried pre-compiling the regexes and caching them without any improvement. Profiler shows 65% of the runtime is due to calling Regex.Match().

Reading up on regex, I see catastrophic backtracking is a performance issue.

Let's say I have an expression like this (this is a simple one just to illustrate the general format of our regexes -- others can contain many more keywords and formats):

(.*) KEYWORD1 AND (.* KEYWORD2)

When I step through with Regex Coach, I see it does a lot of backtracking to match the string.

Can this type of regex be conceptually improved? We do only run against a subset of the entire document (a smaller blob of text), but the preprocessing to pull out the blob isn't perfect either by nature.

So, yeah, pretty much anything can appear before "KEYWORD1" and anything can appear in front of "KEYWORD2", etc.

We can't restrict to A-Z instead of .*, since in the OCR world, letters can sometimes be mistake for numbers (i.e. Illene = I11ene), or we can get garbage characters in there due to OCR recognition errors.

Can these types of Regexs be optimized?

Answers (1)

Related Questions