Reputation: 153
I am trying to capture everything between two strings, Problem is this string I want to capture can be as long as 3000 lines of numbers and comma. So when this happen I am getting an error of catastrophic backtracking.
This is the regex I am using and also sample data below
NEM12[\s\S]+?<\/CSVIntervalData>
<.CSVIntervalData>100,NEM12,201807290900,WBAYM,EEQ 200,3030910307,B1E1K1Q1,03,B1,N1,91111580,kWh,30, 300,20180728,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.056,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.074,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,.278,E75,,,20180729000320, 900 <./CSVIntervalData>
Note that there can a thousand lines of numbers, dot and comma in between
Upvotes: 3
Views: 712
Reputation: 627292
Your regex is based on a lazy matching pattern, which implies a lot of overhead for a regex engine if the string you need to match is very long. When the NEM12
is matched, the </CSVIntervalData>
is tried, and once the engine does not find it, it expands [\s\S]*?
pattern, matches any char, and again re-tests the </CSVIntervalData>
pattern, and so on. Once it does it many times, there may be problems you are having (at regex101, you will usually see a time-out issue, not catastrophical backtracking as there is no backtracking here with the lazy pattern, backtracking is triggered only with greedy patterns).
What you may do is unwrap the lazy pattern:
NEM12[^<]*(?:<(?!/CSVIntervalData>)[^<]*)*</CSVIntervalData>
See the regex demo (note the difference of 317 vs. 46 steps).
The [\s\S]*?
is replaced with [^<]*(?:<(?!/CSVIntervalData>)[^<]*)*
: 0+ chars other than <
, then any 0+ sequences of <
not followed with /CSVIntervalData>
followed with any 0+ chars other than <
. Although it is lengthier, it matches texts in chunks, and is faster and more reliable in case expected matches are long. It will not be that fast if you text contains too many consecutive <
chars in between the delimiters, but it is usually not the case with real data.
If you need to capture what is between these two strings, NEM12
and </CSVIntervalData>
, do not forget the capturing group:
NEM12([^<]*(?:<(?!/CSVIntervalData>)[^<]*)*)</CSVIntervalData>
^ ^
See this regex demo.
Upvotes: 6