Why this pattern only match the first and the last

Question

haystack:

a  · · ·


aaaa

b  · · ·


bbbb

pattern I used:

#]*>(a|b)(?!)[\s\S]*((?!)[\s\S]+)#

this pattern only matches the first h2 content(e.g. a · · ·) and the content in last div(e.g. bbbb)

but I whan it to match all content in the h2 and div to make an one to one map(e.g. a · · ·=>aaaa,b · · ·=>bbbb), how do I do this?

Andrew Clark · Accepted Answer

[\s\S]* and [\s\S]+ are greedy, meaning they will match as many characters as possible. Try changing them to [\s\S]*? and [\s\S]+?.

With your current regex, if you were to put your [\s\S]* into a capturing group you would see that it matches the following:

  · · ·


aaaa

b  · · ·

Adding the ? at the end makes this lazy, so instead of matching as much as possible it will match as few characters as possible, so it will stop at the first like you want. The same reasoning applies to the [\s\S]+ later in your regex.

It also looks like this should fail on your sample string because you have in the middle of your regex, but in your sample text there is always a newline between the closing and the

, you should probably change this section to \s*.  End result:

#]*>(a|b)(?!)[\s\S]*?\s*((?!
)[\s\S]+?)

#


But don't parse HTML with regex!

Why this pattern only match the first and the last

Answers (1)

Related Questions