Braj
Braj

Reputation: 46841

Confused about Grabbing HTML Tags regex pattern

I was reading regular-expressions.info examples to try to learn more regex patterns.

The first example Grabbing HTML Tags talks about a regex for the opening and closing pair of a specific HTML tag.

<TAG\b[^>]*>(.*?)</TAG>

I'm a little confused here. Why is \b[^>]* added to the above regex pattern, where the same thing can be achieved by using the below regex pattern:

<TAG>(.*?)</TAG>

Why is this extra regex pattern used? Will it help in any performance?

Upvotes: 0

Views: 691

Answers (4)

Avinash Raj
Avinash Raj

Reputation: 174696

Because without the word boundary, it matches anything not only the tags.

DEMO

You could try the demo. Just play with and without \b in the pattern.

<TAG\b[^>]*>(.*?)</TAG>

Explanation:

  • < Matches < symbol.
  • TAG Tag name
  • \b Matches between a word character and a non-word character.
  • [^>]* Matches any chars not of > zero or more times.
  • (.*?) Captures the section within the opening and closing tag.? after the * does an reluctant match.
  • </TAG> Matches the end tag.

For example:

Input:

<a href="www.foo.com">link</a>
<ahref="www.foo.com">link</a>

Regex:

<a[^>]*>(.*?)<\/a>

The above regex would match both the links.

Regex:

<a\b[^>]*>(.*?)<\/a>

But this would match the first one because there is an word boundary exists between a and the first space character.

Upvotes: 0

aliteralmind
aliteralmind

Reputation: 20163

The \b[^>]* in

<TAG\b[^>]*>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

allows there to be text (such as parameters: width="30") and whitespace in the open-tag (as long as it's only a TAG and not TAGX or some other type--that's what the \b word boundary is for). Syntax and spacing in html is very loosey goosey. It's always safe to allow extra parameters and whitespace, as a single html tag can span many lines.

The latter regex

<TAG>(.*?)</TAG>

Regular expression visualization

Debuggex Demo

Only allows the opening tag to be exactly <TAG> then "some text which may span multiple lines", then </TAG>.

The ? in .*? is reluctance, meaning the next close </TAG> is the only one that can be matched. Eliminating the ? changes it to greedy, meaning that the last close </TAG> in the search-string is matched.


Be sure to check out the Stack Overflow Regular Expressions FAQ :)

Upvotes: 0

zx81
zx81

Reputation: 41838

  • That's in order to match things like <a href=...> stuff </a>, as opposed to a simple <b> stuff </b> where your option would work.
  • The \b boundary is needed in order to avoid matching things like <attribute ...> stuff </a>
  • The lazy quantifier .*? between the opening and closing tags is needed, as opposed to [^<]*, because between the opening and closing tags you might have another tag (for instance <b>)

Upvotes: 1

McLovin
McLovin

Reputation: 3674

Some opening tags have attributes like <img src="asdf.png">. The tag does not end until the > is reached, so the word boundary and non-> characters match the src="asdf.png".

Upvotes: 0

Related Questions