Escape catastrophic backtracking in HTML markup

Question

Like I said in the title, my data set is markup and it looks somewhat like this




    page





    
     stackoverflow is good 



     i have suffered 7

And my regex engine tries to match each of the following node blocks separately i.e I can attempt to match combine or menu. In one shot, this is what my regex engine looks like, although I dived into its internals just below it.

/((\s+.*)+<\/div>(?:(?=(\s+



It attempts to dive into that markup and grabs the desired node block. That is all. As for the internals, here we go

/
(
  // match text that begins with these literals
  (
   \s+.*
  )+ /* match any white space or character after previous. But the problem is that this matches up till the closing tag of other DIVs i.e greedy. */
  <\/div> // stop at the next closing DIV (this catches the last DIV)
  (?: // begin non-capturing group 
   (?=
    (
     \s+


I've indented it with comments to aid anyone willing to help. I have also scouted for solution from blogs and the manual they say it's caused by an expression having too many possibilities and can be remedied by reducing the chances of outcomes i.e +? instead of * but as hard as I've tried, I'm unable to apply any of it to my current dilemma.

Ry- · Accepted Answer

(\s+.*)+

can probably be simplified to just

[^]*?

which should prevent catastrophic backtracking. Overall simplification:

/[^]*?<\/div>/

Have you considered using an HTML parser instead, though?

var parser = new DOMParser();
var doc = parser.parseFromString(data, 'text/html');
var menu = doc.getElementsByClassName('menu')[0];

console.log(menu.innerHTML);

Escape catastrophic backtracking in HTML markup

Answers (1)

Related Questions