Reputation: 970

Regex to remove external links except provided domain related links php

I want regex to remove all external links from my content and just keep the links of provided domain.

For ex.

$inputContent = 'Lorem Ipsum <a href="http://www.example1.com" target="_blank">http://www.example1.com</a> lorem ipsum dummy text <a href="http://www.mywebsite.com" target="_blank">http://www.mywebsite.com</a>';

Expected output:

$outputContent = 'Lorem Ipsum lorem ipsum dummy text <a href="http://www.mywebsite.com" target="_blank">http://www.mywebsite.com</a>';

Tried with this solution but it's not working.

$pattern = '#<a [^>]*\bhref=([\'"])http.?://((?<!mywebsite)[^\'"])+\1 *>.*?</a>#i';  
$filteredString = preg_replace($pattern, '', $content);

Upvotes: 2

Answers (3)

Armali

Reputation: 19375

Tried with this solution but it's not working.
$pattern = '#<a [^>]*\bhref=([\'"])http.?://((?<!mywebsite)[^\'"])+\1 *>.*?</a>#i';

You were close. To make your solution work, remove just one >, i. e.

  $pattern = '#<a [^>]*\bhref=([\'"])http.?://((?<!mywebsite)[^\'"])+\1 *.*?</a>#i';

Upvotes: 0

Zdeněk

Reputation: 331

The solution with regex:

$inputContent = 'Lorem Ipsum <a href=\'http://www.example1.com\' target="_blank"><strong>http://www.example1.com</strong></a> lorem ipsum dummy text <a href="http://www.mywebsite.com" target="_blank">http://www.mywebsite.com</a>';  

function callback($matches) {
    //print_r($matches);

    if (preg_match('#^https?://(www\.)?mywebsite\.com(/.+)?$#i', $matches[1])) {
        return '<a href="' . $matches[1] . '" target="_blank">' . $matches[2] . '</a>';
    }

    //return '';
    return $matches[2]; // or you can remove only the anchor and print the text only
}

$pattern = '#<a[^>]*href=[\'"]([^\'"]*)[\'"][^>]*>(((?!<a\s).)*)</a>#i';
$filteredString = preg_replace_callback($pattern, 'callback', $inputContent);

echo $filteredString;

Upvotes: 0

revo

Reputation: 48711

What you need here is not Regular Expressions really. You are parsing HTML documents so you should choose the right tool for it: DOMDocument.

<?php

$html = <<< HTML
Lorem Ipsum <a href="http://www.example1.com" target="_blank">http://www.example1.com</a>
lorem ipsum dummy text
<a href="http://mywebsite.com" target="_blank">http://www.mywebsite.com</a>
HTML;


$dom = new \DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED  | LIBXML_HTML_NODEFDTD);
$xpath = new \DOMXPath($dom);

$site = 'mywebsite.com';
// Query all `a` tags that don't start with your website domain name
$anchors = $xpath->query("//a[not(starts-with(@href,'http://{$site}')) and not(starts-with(@href,'http://www.{$site}'))]");

foreach ($anchors as $anchor) {
    $anchor->parentNode->removeChild($anchor);
}

echo $dom->saveHTML();

Output:

<p>Lorem Ipsum 
lorem ipsum dummy text
<a href="http://mywebsite.com" target="_blank">http://www.mywebsite.com</a></p>

Upvotes: 2

Regex to remove external links except provided domain related links php

Answers (3)

Related Questions