Reputation: 2169
I found a way to remove all tag attributes from a html string using php:
$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>
But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.
[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.
Upvotes: 3
Views: 4894
Reputation: 23670
You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML
. You can then recurse through the elements in the document and call removeAttribute
. Regular expressions for HTML tags are notoriously tricky.
REF: http://php.net/manual/en/domdocument.loadhtml.php
Examples: http://coursesweb.net/php-mysql/html-attributes-php
Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src
or href
.
$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
if($node->nodeName != "src" && $node->nodeName != "href") {
$node->parentNode->removeAttribute($node->nodeName);
}
}
echo $dom->saveHTML(); // output cleaned HTML
Here is another solution using xPath to filter on attribute names instead:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//@*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML(); // output cleaned HTML
Tip: Set the DOM parser to UTF-8 if you are using extended character like this:
$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));
Upvotes: 7