Multitut
Multitut

Reputation: 2169

Remove attributes from html tags using PHP while keeping specific attributes

I found a way to remove all tag attributes from a html string using php:

$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>

But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.

[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.

Upvotes: 3

Views: 4894

Answers (1)

Drakes
Drakes

Reputation: 23670

You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML. You can then recurse through the elements in the document and call removeAttribute. Regular expressions for HTML tags are notoriously tricky.

REF: http://php.net/manual/en/domdocument.loadhtml.php

Examples: http://coursesweb.net/php-mysql/html-attributes-php

Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src or href.

$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {             
    if($node->nodeName != "src" && $node->nodeName != "href") {
        $node->parentNode->removeAttribute($node->nodeName);
    }
}

echo $dom->saveHTML();                  // output cleaned HTML

Here is another solution using xPath to filter on attribute names instead:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html_string);           // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//@*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {             
    $node->parentNode->removeAttribute($node->nodeName);
}

echo $dom->saveHTML();                  // output cleaned HTML

Tip: Set the DOM parser to UTF-8 if you are using extended character like this:

$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));

Upvotes: 7

Related Questions