Reputation: 46841
I was reading regular-expressions.info examples to try to learn more regex patterns.
The first example Grabbing HTML Tags talks about a regex for the opening and closing pair of a specific HTML tag.
<TAG\b[^>]*>(.*?)</TAG>
I'm a little confused here. Why is \b[^>]*
added to the above regex pattern, where the same thing can be achieved by using the below regex pattern:
<TAG>(.*?)</TAG>
Why is this extra regex pattern used? Will it help in any performance?
Upvotes: 0
Views: 691
Reputation: 174696
Because without the word boundary, it matches anything not only the tags.
You could try the demo. Just play with and without \b
in the pattern.
<TAG\b[^>]*>(.*?)</TAG>
Explanation:
<
Matches <
symbol.TAG
Tag name\b
Matches between a word character and a non-word character.[^>]*
Matches any chars not of >
zero or more times.(.*?)
Captures the section within the opening and closing tag.?
after the *
does an reluctant match.</TAG>
Matches the end tag.For example:
Input:
<a href="www.foo.com">link</a>
<ahref="www.foo.com">link</a>
Regex:
<a[^>]*>(.*?)<\/a>
The above regex would match both the links.
Regex:
<a\b[^>]*>(.*?)<\/a>
But this would match the first one because there is an word boundary exists between a
and the first space
character.
Upvotes: 0
Reputation: 20163
The \b[^>]*
in
<TAG\b[^>]*>(.*?)</TAG>
allows there to be text (such as parameters: width="30"
) and whitespace in the open-tag (as long as it's only a TAG
and not TAGX
or some other type--that's what the \b
word boundary is for). Syntax and spacing in html is very loosey goosey. It's always safe to allow extra parameters and whitespace, as a single html tag can span many lines.
The latter regex
<TAG>(.*?)</TAG>
Only allows the opening tag to be exactly <TAG>
then "some text which may span multiple lines", then </TAG>
.
The ?
in .*?
is reluctance, meaning the next close </TAG>
is the only one that can be matched. Eliminating the ?
changes it to greedy, meaning that the last close </TAG>
in the search-string is matched.
Be sure to check out the Stack Overflow Regular Expressions FAQ :)
Upvotes: 0
Reputation: 41838
<a href=...> stuff </a>
, as opposed to a simple <b> stuff </b>
where your option would work.\b
boundary is needed in order to avoid matching things like <attribute ...> stuff </a>
.*?
between the opening and closing tags is needed, as opposed to [^<]*
, because between the opening and closing tags you might have another tag (for instance <b>
)Upvotes: 1
Reputation: 3674
Some opening tags have attributes like <img src="asdf.png">
. The tag does not end until the >
is reached, so the word boundary and non->
characters match the src="asdf.png"
.
Upvotes: 0