DOMDocument saving html with extra tags

Question

I am using HTMLDom to manipulate a string, rather than a complete webpage. When I use saveHTML() it automatically throws in doctype and html tags.

$str = 'frament containing html';
$str = utf8_encode($str);
$doc->LoadHTML($str);
...do stuff...
$str = $doc->saveHTML();

What is the correct way to save a fragment of HTML without the automatic inclusion of extra tags. Failing that; the correct method to remove these extra tags?

I used an html parser to avoid using regex's, so it seems a little counter-intuitive to have to use them on the output of a parser.

ThW · Accepted Answer

PHPs DOMDocument repairs the document if you load HTML. That means it adds the html and body elements.

So you need to fetch all nodes inside body and save them as HTML.

$html = <<<'HTML'
Hello World
Text

HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$result = '';
foreach ($xpath->evaluate('/html/body/node()') as $node) {
  $result .= $dom->saveHtml($node);
}

echo $result;

Here is another option, but it is not available everywhere yet. PHP added LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD options.

$dom->loadHtml($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

PHP <= 5.3

The first an best option would be to update the PHP. PHP 5.3 is no longer maintained.

The second option is using DOMDocument::saveXML($node, LIBXML_NOEMPTYTAG). This will generate an XML (XHTML) fragment, but should be enough for the most cases.

The last option would be using the string functions.

DOMDocument saving html with extra tags

Answers (1)

PHP <= 5.3

Related Questions