Reputation: 13
I'd like to find URLs in HTML Sourcecode. But only URL which don't have Tags around them. I came up with this:
(?<!")((http(s)?://|http(s)?://www\.|(?<!/)www\.)([\w\._\-/&%]+))(?!</a>)
It does a good job avoiding URL which are part of links but also finds tagged URLs... I thought by testing "not followed by a closing a-tag" I could avoid tagged URLs... Where is the mistake
<a href="https://foo.com">https://www.foo.com</a> <- should not hit
<span class="bar>www.test.de</span> <-HIT
"http://www.test.de" <- HIT
<a href="http://test.de">http://www.foo.com/_manno/Propello&%_-/ramblay</a> should not HIT
<span>http://www.test.de/alala </span> <-HIT
Upvotes: 1
Views: 70
Reputation: 1197
To make Your sample work - just replace the lookahead (at the end of Your regexp) with:
(?![^<]*<\/a>)
P.S.
If I'd be having similar goal - I'd want following constructions to HIT:
<span class="bar>"http://www.my.test"</span> <- I'd want this to HIT ;)
"http://www.test.de" <- I'd want this to HIT too (while not inside a tag)
<a href="http://www.test.de" option="2"> <- should NOT hit
If Your goal matches what I just described - then remove the lookbehind completely and replace the respective lookahead with:
(?![^<>]*(>|<\/a>))
which basically means that URL won't be followed by anything similar to "</a>" or ">" (a closing-bracket of the tag)
Upvotes: 1