Reputation: 3401
haystack:
<h2 >a · · ·
</h2>
<div class="indent">
aaaa
</div>
<h2 >b · · ·
</h2>
<div class="indent">
bbbb
</div>
pattern I used:
#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*</h2><div class="indent">((?!</div>)[\s\S]+)</div>#
this pattern only matches the first h2 content(e.g. a · · ·
) and the content in last div(e.g. bbbb
)
but I whan it to match all content in the h2 and div to make an one to one map(e.g. a · · ·
=>aaaa
,b · · ·
=>bbbb
), how do I do this?
Upvotes: 0
Views: 144
Reputation: 208545
[\s\S]*
and [\s\S]+
are greedy, meaning they will match as many characters as possible. Try changing them to [\s\S]*?
and [\s\S]+?
.
With your current regex, if you were to put your [\s\S]*
into a capturing group you would see that it matches the following:
· · ·
</h2>
<div class="indent">
aaaa
</div>
<h2 >b · · ·
Adding the ?
at the end makes this lazy, so instead of matching as much as possible it will match as few characters as possible, so it will stop at the first </h2>
like you want. The same reasoning applies to the [\s\S]+
later in your regex.
It also looks like this should fail on your sample string because you have </h2><div...
in the middle of your regex, but in your sample text there is always a newline between the closing </h2>
and the <div>
, you should probably change this section to </h2>\s*<div...
. End result:
#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*?</h2>\s*<div class="indent">((?!</div>)[\s\S]+?)</div>#
But don't parse HTML with regex!
Upvotes: 1