Replace all tags containing given href attribute with Regex or DOM

Question

I'm struggling with this. The idea is to replace all tags, containing specific href attribute inside given string (which comes from a buffer and it is regular HTML, but malformed sometimes).

I've tried to use the PHP DOM approach, also the SimpleHTMLDOM parser library, so far nothing works for me (the problem is that DOM approach returns only links inside element, but not those in section of the page), so I decided to use regex. Here is the non-working PHP DOM approach code:

function remove_css_links($string = "", $css_files = array()) {
        $css_files = array("http://www.example.com/css/css.css?ver=2.70","style.css?ver=3.8.1");
            $xml = new DOMDocument();
        $xml->loadHTML($string);
        $link_list = $xml->getElementsByTagName('link');
        $link_list_length = $link_list->length;
        //The cycle
            for ($i = 0; $i < $link_list_length; $i++) {
          $attributes = $link_list->item($i)->attributes;
          $href = $attributes->getNamedItem('href');
          if (in_array($href->value, $css_files))  {
            //Remove the HTML node
          }                 
        }
        $string = $xml->saveHTML();
        return $string;
}

Here is the regex code, however I know that all of you do not recommend to use it for parsing of HTML, but let's not discuss this here and now:

$html_text = '


    
    



...some content...


';
$url = preg_quote("http://www.example.com/css/css.css?ver=2.70");
$pattern = "~]+) href=".$url."/?>~";
$link = preg_replace($pattern, "", $html_text);

The problem with the regex is that the href attribute can be at any place inside tag and this one, which I use, can detect any type of tags, as you can see I do not want to remove the shortcut icon or alternate types of them, as well as anything different than given URL as href attribute. You can notice that the tags contains different type of quotes, single and/or double.

However, I'm open to suggestions and if it is possible to make the DOM approach work, rather than use regex - it's OK.

Dr.Kameleon · Accepted Answer

OK, so here you are :



    
    



...some content...


';

$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");

foreach ($result as $link)
{
    $href = $link->getattribute("href");

    if ($href=="whatyouwanttofilter")
    {
          $link->parentNode->removeChild($link);
    }

}

$output= $d->saveHTML();
echo $output;

?>

Tested and working. Have fun! :-)

The general idea is :

Load your HTML into a DOMDocument
Look for link nodes, using XPath
Loop through the nodes
Depending on the node's href attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)
After doing all the cleaning-up, re-save the HTML and get it back into a string

Replace all <link> tags containing given href attribute with Regex or DOM

Answers (1)

Related Questions

Replace all &lt;link&gt; tags containing given href attribute with Regex or DOM

Answers (1)

Related Questions

Replace all <link> tags containing given href attribute with Regex or DOM