GENE
GENE

Reputation: 193

extract specific URLs from text

I want to extract URLs from this text:

<body>
<a href="http://domaine.com/t/text/text"> <img src="http://domaine.com/i/text/text"></a> <br>
<a href="http://domaine.com/text"></a> <br>
<a href="http://domaine.com"></a> <br>
<a href="http://domaine.com/text/text"></a> <br>
<a href="http://[GoTo]"></a> <br>
<a href="http://[NextURL]"></a> <br>
</body>

but i want to exclude some URLs with specific patterns from being extracted; those patterns are:

http://***/i/***/***
http://***/t/***/***
http://[GoTo]
http://[NextURL]

which means i will just get this URLs as a result:

http://domaine.com/text
http://domaine.com
http://domaine.com/text/text

what i did so far is using this Regex:

$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);

but as you can notice i get all the URLs extracted, and i don't know how to exclude some of them using my specific petterns.

Upvotes: 0

Views: 150

Answers (1)

Jeremy
Jeremy

Reputation: 521

What you are looking for is a negative lookahead:

$regex = '/https?:\/\/(?!\[GoTo\]|\[NextURL\]|[^\" ]*\/i\/[^\" ]+|[^\" ]*\/t\/[^\" ]*)[^\" ]+/i';

?! at the beginning of a submatch should prevent matching for URLs with the enclosed pattern. This might need tweaking for specific corner cases, but with the problem as stated, this should get you what you need.

Upvotes: 2

Related Questions