Reputation: 15089
I have this piece of text, and I want to extract links from this. Some links with have tags and some will be out there just like that, in plain format. But I also have images, and I don't want their links.
How would I extract links from this piece of text but ignoring image links. So basically and google.com should both be extract.
string(441) "<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a> should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>"
I have tried the following but its incomplete:
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
Upvotes: 0
Views: 204
Reputation: 822
I played around with this a lot more and have an answer that may better suit what you are trying to do with a bit of "future proofing"
$str = '<p class="fr-tag">Please visit www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a> should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
$str = str_replace(' ',' ',$str);
$strArr = explode(' ',$str);
$len = count($strArr);
for($i = 0; $i < $len; $i++){
if(stristr($strArr[$i],'http') || stristr($strArr[$i],"www")){
$matches[] = $strArr[$i];
}
}
echo "<pre>";
print_r($matches);
echo "</pre>";
I went back and analyzed your string and noticed that if you translate the
to spaces you can then explode
the string into an array, step through that and if any elements contain http
or www
then add them to the $matches
array to be processed later. The output is pretty clean and easy to work with and you also get rid of most of the html markup this way.
Something to note is that this probably isn't the best way to do this. I haven't tested with any other strings but the one you offered so there's optimization that can be done.
Upvotes: -1
Reputation: 822
Using just that one string to test, the following works for me:
$str = '<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this <a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg" rel="nofollow">link</a> should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
preg_match('~a href="(.*?)"~', $str, $strArr);
Using a href ="
..."
in the preg_match()
statement returns an array, $strArr
containing two values, the two links to google.
Array
(
[0] => a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg"
[1] => https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg
)
Upvotes: 1
Reputation: 722
I would try something like this.
Find and remove images tags:
$content = preg_replace("/<img[^>]+\>/i", "(image) ", $content);
Find and collect URLs.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $content, $match);
Output Urls:
print_r($match);
Good luck!
Upvotes: 1