Reputation: 31919
I'm struggling with this. The idea is to replace all <link>
tags, containing specific href
attribute inside given string (which comes from a buffer and it is regular HTML, but malformed sometimes).
I've tried to use the PHP DOM
approach, also the SimpleHTMLDOM parser library, so far nothing works for me (the problem is that DOM
approach returns only links inside <body>
element, but not those in <head>
section of the page), so I decided to use regex.
Here is the non-working PHP DOM
approach code:
function remove_css_links($string = "", $css_files = array()) {
$css_files = array("http://www.example.com/css/css.css?ver=2.70","style.css?ver=3.8.1");
$xml = new DOMDocument();
$xml->loadHTML($string);
$link_list = $xml->getElementsByTagName('link');
$link_list_length = $link_list->length;
//The cycle
for ($i = 0; $i < $link_list_length; $i++) {
$attributes = $link_list->item($i)->attributes;
$href = $attributes->getNamedItem('href');
if (in_array($href->value, $css_files)) {
//Remove the HTML node
}
}
$string = $xml->saveHTML();
return $string;
}
Here is the regex code, however I know that all of you do not recommend to use it for parsing of HTML, but let's not discuss this here and now:
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel=\'stylesheet\' href=\'http://www.example.com/css/css.css?ver=2.70\' type=\'text/css\' media=\'all\' /></head>
<body>...some content...
<link rel=\'stylesheet\' id=\'css\' href=\'style.css?ver=3.8.1\' type=\'text/css\' media=\'all\' />
</body></html>
';
$url = preg_quote("http://www.example.com/css/css.css?ver=2.70");
$pattern = "~<link([^>]+) href=".$url."/?>~";
$link = preg_replace($pattern, "", $html_text);
The problem with the regex is that the href
attribute can be at any place inside <link>
tag and this one, which I use, can detect any type of <link>
tags, as you can see I do not want to remove the shortcut icon
or alternate
types of them, as well as anything different than given URL as href
attribute. You can notice that the <link>
tags contains different type of quotes, single and/or double.
However, I'm open to suggestions and if it is possible to make the DOM
approach work, rather than use regex - it's OK.
Upvotes: 2
Views: 2073
Reputation: 22820
OK, so here you are :
<?php
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet" href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css" href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';
$d = new DOMDocument();
@$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");
foreach ($result as $link)
{
$href = $link->getattribute("href");
if ($href=="whatyouwanttofilter")
{
$link->parentNode->removeChild($link);
}
}
$output= $d->saveHTML();
echo $output;
?>
Tested and working. Have fun! :-)
The general idea is :
DOMDocument
link
nodes, using XPath
href
attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)Upvotes: 2