shorshe
shorshe

Reputation: 13

Find URLs in HTML-Sourcecode which are not yet tagged. Ignore tagged URLs

I'd like to find URLs in HTML Sourcecode. But only URL which don't have Tags around them. I came up with this:

(?<!")((http(s)?://|http(s)?://www\.|(?<!/)www\.)([\w\._\-/&%]+))(?!</a>)

It does a good job avoiding URL which are part of links but also finds tagged URLs... I thought by testing "not followed by a closing a-tag" I could avoid tagged URLs... Where is the mistake

<a href="https://foo.com">https://www.foo.com</a> <- should not hit
<span class="bar>www.test.de</span> <-HIT
"http://www.test.de" <- HIT
<a href="http://test.de">http://www.foo.com/_manno/Propello&%_-/ramblay</a> should not HIT
<span>http://www.test.de/alala </span> <-HIT

My RegEx on Debuggex

Upvotes: 1

Views: 70

Answers (1)

Vlad
Vlad

Reputation: 1197

To make Your sample work - just replace the lookahead (at the end of Your regexp) with:

(?![^<]*<\/a>)

P.S.

If I'd be having similar goal - I'd want following constructions to HIT:

<span class="bar>"http://www.my.test"</span> <- I'd want this to HIT ;)
"http://www.test.de" <- I'd want this to HIT too (while not inside a tag)
<a href="http://www.test.de" option="2"> <- should NOT hit

If Your goal matches what I just described - then remove the lookbehind completely and replace the respective lookahead with:

(?![^<>]*(>|<\/a>))

which basically means that URL won't be followed by anything similar to "</a>" or ">" (a closing-bracket of the tag)

Upvotes: 1

Related Questions