Finding links in HTML using regex

Question

I'm trying to find all links in a Wikipedia article while excluding fragments (links starting with #).

Initially I was using which worked fine (although what it captures is a bit messy, I can clean this up later in python). But then I realized that "


My thought behind this was capture text starting with '
Both of these are now captured, which is what I want



And this is not captured, which is also what I want

But this is captured, which I do not want

Why does this happen?

Arnold Daniels · Accepted Answer

With ., you're matching all characters, including the closing >.

The non-greedy modifier in .*? means that it will not include the > if it finds a match, but if it doesn't it will include it to try and find a match.

The same goes for \S, which matches all non-space characters including a closing ".

You should explicitly exclude all characters that shouldn't match, and not rely on non-greedy.

]*\bhref="([^#"][^"]*)"

Explanation

matches the characters literally (case sensitive)


\s matches any whitespace character (equal to [
	\f\v ])
Match a single character not present in the list below [^>]*

* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
> matches the character > literally (case sensitive)


\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ([^#"][^"]*)

Match a single character not present in the list below [^#"]

#" matches a single character in the list #" (case sensitive)
Match a single character not present in the list below [^"]*

* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
" matches the character " literally (case sensitive)






" matches the character " literally (case sensitive)


Try it @ regex101
This won't properly match all cases in HTML. As the OP stated, this is just an exercise in regular expressions.

Finding links in HTML using regex

Answers (2)

Try it @ regex101

Related Questions