dotslashlu
dotslashlu

Reputation: 3401

Why this pattern only match the first and the last

haystack:

<h2 >a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
aaaa
</div>
<h2 >b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
bbbb
</div>

pattern I used:

#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*</h2><div class="indent">((?!</div>)[\s\S]+)</div>#

this pattern only matches the first h2 content(e.g. a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;) and the content in last div(e.g. bbbb)

but I whan it to match all content in the h2 and div to make an one to one map(e.g. a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;=>aaaa,b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;=>bbbb), how do I do this?

Upvotes: 0

Views: 144

Answers (1)

Andrew Clark
Andrew Clark

Reputation: 208545

[\s\S]* and [\s\S]+ are greedy, meaning they will match as many characters as possible. Try changing them to [\s\S]*? and [\s\S]+?.

With your current regex, if you were to put your [\s\S]* into a capturing group you would see that it matches the following:

&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
aaaa
</div>
<h2 >b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;

Adding the ? at the end makes this lazy, so instead of matching as much as possible it will match as few characters as possible, so it will stop at the first </h2> like you want. The same reasoning applies to the [\s\S]+ later in your regex.

It also looks like this should fail on your sample string because you have </h2><div... in the middle of your regex, but in your sample text there is always a newline between the closing </h2> and the <div>, you should probably change this section to </h2>\s*<div.... End result:

#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*?</h2>\s*<div class="indent">((?!</div>)[\s\S]+?)</div>#

But don't parse HTML with regex!

Upvotes: 1

Related Questions