You Old Fool
You Old Fool

Reputation: 22941

Why do these two DOMDocument functions behave differently?

There are two approaches to getting the outer HTML of a DOMDocument node suggested here: How to return outer html of DOMDocument?

I'm interested in why they seem to treat HTML entities differently.

EXAMPLE:

function outerHTML($node) {
    $doc = new DOMDocument();
    $doc->appendChild($doc->importNode($node, true));
    return $doc->saveHTML();
}

$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html);
$el = $dom->getElementsByTagname('p')->item(0);
echo $el->ownerDocument->saveHtml($el) . PHP_EOL;
echo outerHTML($el) . PHP_EOL;

OUTPUT:

<p>ACME’s 27” Monitor is $200.</p>
<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>

Both methods use saveHTML() but for some reason the function preserves html entities in the final output, while directly calling saveHTML() with a node context does not. Can anyone explain why - preferably with some kind of authoritative reference?

Upvotes: 1

Views: 140

Answers (1)

miken32
miken32

Reputation: 42716

What this comes down to is even more simple than your test case above:

<?php
$html = '<p>ACME&rsquo;s 27&rdquo; Monitor is $200.</p>';
$dom = new DOMDocument();
@$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

echo $dom->saveHtml($dom->documentElement) . PHP_EOL;
echo $dom->saveHtml() . PHP_EOL;

So the question becomes, why does DomDocument::saveHtml behave differently when saving an entire document instead of just a specific node?

Taking a peek at the PHP source, we find a check for whether it's working with a single node or a whole document. For the former, the htmlNodeDumpFormatOutput function is called with the encoding explicitly set to null. For the latter, the htmlDocDumpMemoryFormat function is used, the encoding is not included as an argument to this function.

Both of these functions are from the libxml2 library. Looking at that source, we can see that htmlDocDumpMemoryFormat tries to detect the document encoding, and explicitly sets it to ASCII/HTML if it can't find one.

Both functions end up calling htmlNodeListDumpOutput, passing it the encoding that's been determined; either null – which results in no encoding – or ASCII/HTML – which encodes using HTML entities.

My guess is that, for a document fragment or single node, encoding is considered less important than for a full document.

Upvotes: 1

Related Questions