Reputation: 249
In my project in C# I am parsing text for dates. The dates can be in various formats, objective is to find and correct a number of date format errors. Various date formats means a set of defined date formats isn't feasible. Originally I had a set of around 10 regexes applied one by one to the input string. This was functionally fine but when the string got towards 200 KB of text, performance became a problem as the function took about 150 ms.
I found I could improve performance considerably by applying the date regexes only to substrings that were dates. So if all dates had to have the English month name, using a regex of
\b(January|February|March|April|May|June|July|August|September|October|November|December)\b
would find them. If I then did some substringing to get the text around the month matched, overall function performance was about 25 ms, so much better. However, the substring/loop, length check code is untidy and doesn't feel like a really good solution. What I really wanted was a single regex to match the month and text around it, something like
.{0,25}\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}
is functionally fine. However, performance of this regex is about 3500 ms to find matches on the same long input string.
Now the similar regex
(?<=.{0,25})\b(January|February|March|April|May|June|July|August|September|October|November|December)\b.{0,25}
with a positive lookbehind finds the matches in about 15 ms (due very reduced backtracking, reasons I accept and have some understanding of). However, that doesn't work for my use as I need the text before and after the month name to be included in the match result.
So, my question is, can I have a regex that has the performance of using the lookbehind, but the functionality of providing all the text within the match result?
Upvotes: 2
Views: 489
Reputation: 11601
The performance gain is an illusion. Normally, something like .{0,25}
will cause a lot of backtracking, which explains the poor performance that you're seeing. When placed inside a look-behind, however, it stops behaving greedily and backtracking, the look-behind will look for the smallest possible match, which means that 0 characters will be tried, with no backtracking. This means that the look-behind is completely useless since it will always match zero characters before the month name.
What if you extract the context after you find a match on the month name by using the position of the match? For each match
in regex.Matches(str)
, get match.Index
and match.Length
and substring before and after those positions.
Upvotes: 2