Reputation: 193
I want to extract URLs from this text:
<body>
<a href="http://domaine.com/t/text/text"> <img src="http://domaine.com/i/text/text"></a> <br>
<a href="http://domaine.com/text"></a> <br>
<a href="http://domaine.com"></a> <br>
<a href="http://domaine.com/text/text"></a> <br>
<a href="http://[GoTo]"></a> <br>
<a href="http://[NextURL]"></a> <br>
</body>
but i want to exclude some URLs with specific patterns from being extracted; those patterns are:
http://***/i/***/***
http://***/t/***/***
http://[GoTo]
http://[NextURL]
which means i will just get this URLs as a result:
http://domaine.com/text
http://domaine.com
http://domaine.com/text/text
what i did so far is using this Regex:
$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);
but as you can notice i get all the URLs extracted, and i don't know how to exclude some of them using my specific petterns.
Upvotes: 0
Views: 150
Reputation: 521
What you are looking for is a negative lookahead:
$regex = '/https?:\/\/(?!\[GoTo\]|\[NextURL\]|[^\" ]*\/i\/[^\" ]+|[^\" ]*\/t\/[^\" ]*)[^\" ]+/i';
?! at the beginning of a submatch should prevent matching for URLs with the enclosed pattern. This might need tweaking for specific corner cases, but with the problem as stated, this should get you what you need.
Upvotes: 2