Nikita240
Nikita240

Reputation: 1427

PHP - DOMDocument::saveHTML creates weird entities

So I'm pulling xml from an API, and my goal is to save that xhtml as html in a file for users to view.

The problem is, the saved html file get's some new weird entities that it shouldn't have. Here is an example.

This is what the pulled xhtml snippet looks like:

<p>    "At that point

And here is what the saved file looks like:

<p>&Acirc;&nbsp;&Acirc;&nbsp;&Acirc;&nbsp; "At that point

And this is what Chromium sees:

<p>Â&nbsp;Â&nbsp;Â&nbsp; "At that point

From the xhtml being pulled, to it being saved, it gets processed by a few different classes, so I will simplify all the objects the data gets passed around for simplicity's sake.

//curl call is initialized here

$raw = curl_exec($ch);

$simplexml = simplexml_load_string($raw);

$xmlstr = $simplexml->xpath($xpath)->asXML();

$html = new DOMDocument;
$html->formatOutput = true;
$wrapper = $html->createElement("div");
$wrapper->setAttribute("id", "wrapper");
$wrapper = $html->appendChild($wrapper);

$content = DOMDocument::loadHTML($xmlstr, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($content->firstChild->childNodes as $node)
    $wrapper->appendChild($html->importNode($node, TRUE));

$htmlstr = $html->saveHTML();


$html = new DOMDocument;
$html->formatOutput = true;

$content = DOMDocument::loadHTML($htmlstr, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($content->childNodes as $node)
    $html->appendChild($html->importNode($node, TRUE));

$html_str = $html->saveHTML();

file_put_contents($content_path, $html_str);

Yeah it's a bit complex, but the data gets passed around quite a bit, since it needs to have a lot of stuff added to it.

I just don't understand where those new entities come from. Any help would be appreciated.

Upvotes: 4

Views: 1733

Answers (2)

Nikita240
Nikita240

Reputation: 1427

I figured out what I was doing wrong.

I saved the output with simplexml like this:

$xmlstr = $simplexml->xpath($xpath)->asXML();

This formats the output as XML, but later, when I imported the output to the DOMDoc, I did it with importHTML:

$content = DOMDocument::loadHTML($xmlstr, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

I was able to resolve the issue by simply using loadXML instead of loadHTML:

$content = DOMDocument::loadXML($xmlstr, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Now my output is correct:

<p>&nbsp;&nbsp;&nbsp; "At that point

Although I'm going to write a function to trim() these paragraphs. I don't know why they are provided like that.

Upvotes: 4

Kamil
Kamil

Reputation: 1030

I think that you read or write that with a bad encoding.. First try flush the loaded content to check if the content seems right.

Upvotes: 0

Related Questions