regex: performance of lookbehind while still capturing data?

Question

In my project in C# I am parsing text for dates. The dates can be in various formats, objective is to find and correct a number of date format errors. Various date formats means a set of defined date formats isn't feasible. Originally I had a set of around 10 regexes applied one by one to the input string. This was functionally fine but when the string got towards 200 KB of text, performance became a problem as the function took about 150 ms.

I found I could improve performance considerably by applying the date regexes only to substrings that were dates. So if all dates had to have the English month name, using a regex of

\b(January|February|March|April|May|June|July|August|September|October|November|December)\b

would find them. If I then did some substringing to get the text around the month matched, overall function performance was about 25 ms, so much better. However, the substring/loop, length check code is untidy and doesn't feel like a really good solution. What I really wanted was a single regex to match the month and text around it, something like

.{0,25}\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

is functionally fine. However, performance of this regex is about 3500 ms to find matches on the same long input string.

Now the similar regex

(?<=.{0,25})\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}

with a positive lookbehind finds the matches in about 15 ms (due very reduced backtracking, reasons I accept and have some understanding of). However, that doesn't work for my use as I need the text before and after the month name to be included in the match result.

So, my question is, can I have a regex that has the performance of using the lookbehind, but the functionality of providing all the text within the match result?

regex: performance of lookbehind while still capturing data?

Answers (1)

Related Questions