Jonas
Jonas

Reputation: 19

Finding links in HTML using regex

I'm trying to find all links in a Wikipedia article while excluding fragments (links starting with #).

Initially I was using <a href=\"[^#]\S*?\" which worked fine (although what it captures is a bit messy, I can clean this up later in python). But then I realized that "<a " isn't necessarily directly followed by "href", so I changed the expression to

<a .*?href=\"[^#]\S*?\"

My thought behind this was capture text starting with '<a ', followed by any characters zero to unlimited times until you reach 'href="', then a character that is not '#' followed by zero to unlimited characters that are not whitespace until a quote (") is reached.

Both of these are now captured, which is what I want

<a title="test" href="link"

<a href="link"

And this is not captured, which is also what I want

<a class="class1" href="#fragment">

But this is captured, which I do not want

<a href="#citewnotew1"></a></sup></div></td></tr><tr><th scope="row" style="line-height:1.2em; padding-right:0.65em;"><a href="/wiki/Filename_extension"

Why does this happen?

Upvotes: 0

Views: 203

Answers (2)

Inspiraller
Inspiraller

Reputation: 3806

Try this : This instead matches any character that isn't the end of the tag or the beginning of a new tag.

<a [^\<\>]*href\=\"[^\#][^\"]*?\"

Upvotes: 0

Arnold Daniels
Arnold Daniels

Reputation: 16563

With ., you're matching all characters, including the closing >.

The non-greedy modifier in .*? means that it will not include the > if it finds a match, but if it doesn't it will include it to try and find a match.

The same goes for \S, which matches all non-space characters including a closing ".

You should explicitly exclude all characters that shouldn't match, and not rely on non-greedy.

<a\s[^>]*\bhref="([^#"][^"]*)"

Explanation

  • <a matches the characters <a literally (case sensitive)
  • \s matches any whitespace character (equal to [\r\n\t\f\v ])
  • Match a single character not present in the list below [^>]*
    • * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    • > matches the character > literally (case sensitive)
  • \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
  • href=" matches the characters href=" literally (case sensitive)
  • 1st Capturing Group ([^#"][^"]*)
    • Match a single character not present in the list below [^#"]
      • #" matches a single character in the list #" (case sensitive)
      • Match a single character not present in the list below [^"]*
        • * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
        • " matches the character " literally (case sensitive)
  • " matches the character " literally (case sensitive)

Try it @ regex101

This won't properly match all cases in HTML. As the OP stated, this is just an exercise in regular expressions.

Upvotes: 1

Related Questions