Reputation: 15372
this regular expression should match an html start tag, I think.
var results = html.match(/<(\/?)(\w+)([^>]*?)>/);
I see it should first capture the <
, but then I am confused what this capture (\/?)
accomplishes. Am I correct in reasoning that the ([^>]*?)>
searches for every character except >
>= 0 times? If so, why is the (\w+)
capture necessary? Doesn't it fall within the purview of [^>]*?
Upvotes: 7
Views: 6930
Reputation: 46647
Take it token by token:
/
begin regex literal<
match a literal <
(\/?)
match 0 or 1 (?
) literal /
, which is escaped by the \
(\w+)
match one or more "word characters"([^>]*?)
lazily* match zero or more (*?
) of anything that is not a >
>
match a literal >
/
end regex literallazily* - adding "?" after a repetition quantifier will make it perform lazily, meaning the regex will match the preceding token the minimum number of times. See the documentation.
So essentially this regular expression will match "<", potentially followed by a "/", followed by any number of letters, digits, or underscores, followed by anything that is not a ">", and finally followed by a ">".
That being said, the token (\w+)
is not redundant, as it ensures there is at least one word character in between <
and >
.
Please be aware that attempting to parse HTML with regular expressions is generally a bad idea.
Upvotes: 4
Reputation: 13574
Using the power of debuggex to generate you an image :)
<(\/?)(\w+)([^>]*?)>
Will be evaluated like this
As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following:
(\/?)
existence of /
(it's a closing tag, if present)(\w+)
name of the tag([^>]*?)
everything else until the tag closes (e.g. attributes)This way it matches <a href="#">
. Interestingly it does not match <a data-fun="fun>nofun">
correctly because it stops at the >
within the data-fun
attribute. Although (I think) >
is valid in an attribute value.
Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows Letter | Digit | '.' | '-' | '_' | ':' | ..
(source: XHTML spec). (\w+)
, however, does not match .
, -
, and :
. An imaginary <.foobar>
tag will not be matched by this regex. This should not have any real life impact, though.
You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.
Upvotes: 4
Reputation: 27012
To answer your last question, (\w+)
and ([^>]*?)
are not redundant. They both serve important functions in the expression.
This expression finds start or end tags.
(\/?)
matches a /
, but the ?
makes it optional.
(\w+)
matches word characters, intended to match the tag name here.
([^>]*?)
is intended to match attributes.
So if you had the string <div class="text">
,
The (\w+)
in the expression would match div
and the ([^>]*?)
would match class="text"
Upvotes: 2
Reputation: 28305
Demo (in ruby, not javascript, but it makes no difference): http://www.rubular.com/r/bhw2O28qUr
To summarise, it's to capture end tags.
Upvotes: 0
Reputation: 71538
(\/?)
matches, and catches any closing tag, such as </i>
maybe, or </strong>
if you're familiar with them?
Another thing to note is that \w
is really the character class [a-zA-Z_\d]
, so that other characters like =
, "
, etc are not matched, and will however be matched by [^>]
. And yes, you are correct about that bit.
Upvotes: 3