Reputation: 441
Like I said in the title, my data set is markup and it looks somewhat like this
<!DOCTYPE html>
<html>
<head>
<title>page</title>
</head>
<body>
<main>
<div class="menu">
<img src=mmayboy.jpg>
<p> stackoverflow is good </p>
</div>
<div class="combine">
<p> i have suffered <span>7</span></p>
</div>
</main>
</body>
</html>
And my regex engine tries to match each of the following node blocks separately i.e I can attempt to match combine
or menu
. In one shot, this is what my regex engine looks like, although I dived into its internals just below it.
/(<div class="menu">(\s+.*)+<\/div>(?:(?=(\s+<div))))/
It attempts to dive into that markup and grabs the desired node block. That is all. As for the internals, here we go
/
(
<div class="menu"> // match text that begins with these literals
(
\s+.*
)+ /* match any white space or character after previous. But the problem is that this matches up till the closing tag of other DIVs i.e greedy. */
<\/div> // stop at the next closing DIV (this catches the last DIV)
(?: // begin non-capturing group
(?=
(
\s+<div
) /* I'm using the positive lookahead to make sure previous match is not followed by a space and a new DIV tag. This is where the catastrophic backtracking is raised. */
)
)
)
/
I've indented it with comments to aid anyone willing to help. I have also scouted for solution from blogs and the manual they say it's caused by an expression having too many possibilities and can be remedied by reducing the chances of outcomes i.e +?
instead of *
but as hard as I've tried, I'm unable to apply any of it to my current dilemma.
Upvotes: 0
Views: 145
Reputation: 224942
(\s+.*)+
can probably be simplified to just
[^]*?
which should prevent catastrophic backtracking. Overall simplification:
/<div class="menu">[^]*?<\/div>/
Have you considered using an HTML parser instead, though?
var parser = new DOMParser();
var doc = parser.parseFromString(data, 'text/html');
var menu = doc.getElementsByClassName('menu')[0];
console.log(menu.innerHTML);
Upvotes: 1