Reputation: 46841

Confused about Grabbing HTML Tags regex pattern

I was reading regular-expressions.info examples to try to learn more regex patterns.

The first example Grabbing HTML Tags talks about a regex for the opening and closing pair of a specific HTML tag.

<TAG\b[^>]*>(.*?)</TAG>

I'm a little confused here. Why is \b[^>]* added to the above regex pattern, where the same thing can be achieved by using the below regex pattern:

<TAG>(.*?)</TAG>

Why is this extra regex pattern used? Will it help in any performance?

Upvotes: 0

Answers (4)

Avinash Raj

Reputation: 174696

Because without the word boundary, it matches anything not only the tags.

DEMO

You could try the demo. Just play with and without \b in the pattern.

<TAG\b[^>]*>(.*?)</TAG>

Explanation:

< Matches < symbol.
TAG Tag name
\b Matches between a word character and a non-word character.
[^>]* Matches any chars not of > zero or more times.
(.*?) Captures the section within the opening and closing tag.? after the * does an reluctant match.
</TAG> Matches the end tag.

For example:

Input:

<a href="www.foo.com">link</a>
<ahref="www.foo.com">link</a>

Regex:

<a[^>]*>(.*?)<\/a>

The above regex would match both the links.

Regex:

<a\b[^>]*>(.*?)<\/a>

But this would match the first one because there is an word boundary exists between a and the first space character.

Upvotes: 0

aliteralmind

Reputation: 20163

The \b[^>]* in

<TAG\b[^>]*>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

allows there to be text (such as parameters: width="30") and whitespace in the open-tag (as long as it's only a TAG and not TAGX or some other type--that's what the \b word boundary is for). Syntax and spacing in html is very loosey goosey. It's always safe to allow extra parameters and whitespace, as a single html tag can span many lines.

The latter regex

<TAG>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

Only allows the opening tag to be exactly <TAG> then "some text which may span multiple lines", then </TAG>.

The ? in .*? is reluctance, meaning the next close </TAG> is the only one that can be matched. Eliminating the ? changes it to greedy, meaning that the last close </TAG> in the search-string is matched.

Be sure to check out the Stack Overflow Regular Expressions FAQ :)

Upvotes: 0

zx81

Reputation: 41838

That's in order to match things like <a href=...> stuff </a>, as opposed to a simple <b> stuff </b> where your option would work.
The \b boundary is needed in order to avoid matching things like <attribute ...> stuff </a>
The lazy quantifier .*? between the opening and closing tags is needed, as opposed to [^<]*, because between the opening and closing tags you might have another tag (for instance <b>)

Upvotes: 1

McLovin

Reputation: 3674

Some opening tags have attributes like <img src="asdf.png">. The tag does not end until the > is reached, so the word boundary and non-> characters match the src="asdf.png".

Upvotes: 0

Confused about Grabbing HTML Tags regex pattern

Answers (4)

Related Questions