Teiv
Teiv

Reputation: 2635

Recursively loop through the DOM tree and remove unwanted tags?

$tags = array(
    "applet" => 1,  
    "script" => 1
);

$html = file_get_contents("test.html");
$dom = new DOMdocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$body = $xpath->query("//body")->item(0);

I'm about looping through the "body" of the web page and remove all unwanted tags listed in the $tags array but I can't find a way. So how can I do it?

Upvotes: 3

Views: 2759

Answers (2)

Dvir Berebi
Dvir Berebi

Reputation: 1566

Had you considered HTML Purifier? starting with your own html sanitizing is just re-inventing the wheel, and isn't easy to accomplish.

Furthermore, a blacklist approach is also bad, see SO/why-use-a-whitelist-for-html-sanitizing

You may also be interested in reading how to cinfigure allowed tags & attributes or testing HTML Purifier demo

Upvotes: 6

Epharion
Epharion

Reputation: 1071

$tags = array(
    "applet" => 1,  
    "script" => 1
);

$html = file_get_contents("test.html");
$dom = new DOMdocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

for($i=0; $i<count($tags); ++$i) {
   $list = $xpath->query("//".$tags[$i]);
   for($j=0; $j<$list->length; ++$j) {
      $node = $list->item($j);
      if ($node == null) continue;
      $node->parentNode->removeChild($node);
   }
}

$string = $dom->saveXML();

Something like that.

Upvotes: 4

Related Questions