user1926840
user1926840

Reputation:

Regular Expression Matching More Than Expected

I have a large chunk of HTML.

With this:

~<div>(?:.*?)<a[\s]+[^>]*?href[\s]?=[\s"\']+(#_ftnref([0-9]+))["\']+.*?>(?:[^<]+|.*?)?</a>(.*?)</div>~si

I am capturing this:

<div> </div><hr align="left" size="1" width="33%" /><div><p><a title="" href="#_ftnref1">[1]</a> This is not to suggest that there are only two possible arguments to be made in support of  blah blah <em>blah</em>.</p></div>

But! I want this:

<div><p><a title="" href="#_ftnref1">[1]</a> This is not to suggest that there are only two possible arguments to be made in support of  blah blah <em>blah</em>.</p></div>

Can you help?

PS: (?: ), in contrast to ( ), is used to avoid capturing text. I'm doing that on purpose because I want the returned $matches array to be consistent for several different regex not mentioned in this post.

Upvotes: 0

Views: 83

Answers (1)

mario
mario

Reputation: 145502

If lazy matching with .*? doesn't work, you need to come up with some exclusion pattern.

(?:(?!</div>).)*

Would for instance only match one div and stop/skip after any contained </div>

Alternatively a length constraint could be a workaround:

(?:.{0,20})

Upvotes: 1

Related Questions