Reputation: 19
I'm trying to find all links in a Wikipedia article while excluding fragments (links starting with #).
Initially I was using <a href=\"[^#]\S*?\"
which worked fine (although what it captures is a bit messy, I can clean this up later in python). But then I realized that "<a " isn't necessarily directly followed by "href", so I changed the expression to
<a .*?href=\"[^#]\S*?\"
My thought behind this was capture text starting with '<a ', followed by any characters zero to unlimited times until you reach 'href="', then a character that is not '#' followed by zero to unlimited characters that are not whitespace until a quote (") is reached.
Both of these are now captured, which is what I want
<a title="test" href="link"
<a href="link"
And this is not captured, which is also what I want
<a class="class1" href="#fragment">
But this is captured, which I do not want
<a href="#citewnotew1"></a></sup></div></td></tr><tr><th scope="row" style="line-height:1.2em; padding-right:0.65em;"><a href="/wiki/Filename_extension"
Why does this happen?
Upvotes: 0
Views: 203
Reputation: 3806
Try this : This instead matches any character that isn't the end of the tag or the beginning of a new tag.
<a [^\<\>]*href\=\"[^\#][^\"]*?\"
Upvotes: 0
Reputation: 16563
With .
, you're matching all characters, including the closing >
.
The non-greedy modifier in .*?
means that it will not include the >
if it finds a match, but if it doesn't it will include it to try and find a match.
The same goes for \S
, which matches all non-space characters including a closing "
.
You should explicitly exclude all characters that shouldn't match, and not rely on non-greedy.
<a\s[^>]*\bhref="([^#"][^"]*)"
Explanation
<a
matches the characters <a
literally (case sensitive)\s
matches any whitespace character (equal to [\r\n\t\f\v ]
)[^>]*
*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)>
matches the character > literally (case sensitive)\b
assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
href="
matches the characters href="
literally (case sensitive)([^#"][^"]*)
[^#"]
#"
matches a single character in the list #" (case sensitive)[^"]*
*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)"
matches the character "
literally (case sensitive)"
matches the character "
literally (case sensitive)This won't properly match all cases in HTML. As the OP stated, this is just an exercise in regular expressions.
Upvotes: 1