anonymous
anonymous

Reputation: 121

Disable html entity encoding in PHP DOMDocument

I cannot figure out how to stop DOMDocument from mangling these characters.

<?php

$doc = new DOMDocument();
$doc->substituteEntities = false;
$doc->loadHTML('<p>¯\(°_o)/¯</p>');
print_r($doc->saveHTML());

?>

Expected Output: ¯(°_o)/¯

Actual Output: ¯(°_o)/¯

http://codepad.org/W83eHSsT

Upvotes: 12

Views: 6667

Answers (3)

migli
migli

Reputation: 3252

PHP DOMDocument will not convert characters to htmlentities if the HTML is properly loaded in UTF-8 and has the meta charset=utf-8 tag.

The idea is to:

  • Properly detect the HTML source encoding and convert it in UTF-8
  • Load the DOMDocument with the UTF-8 charset
  • Add the meta charset=utf-8 tag to the DOMDocument
  • Do any stuff
  • Remove the meta charset=utf-8 tag from after saving the result.

Here's a sample code:

<?php
$htmlContent = file_get_contents('source.html');
$convertedContent = mb_convert_encoding($htmlContent, 'UTF-8', mb_detect_encoding($htmlContent));

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($convertedContent);

// Create the meta tag element
$metaTag = $dom->createElement('meta');
$metaTag->setAttribute('http-equiv', 'Content-Type');
$metaTag->setAttribute('content', 'text/html; charset=utf-8');

// Append the meta charset tag to the head element
$head = $dom->getElementsByTagName('head')->item(0);
$head->appendChild($metaTag);

// Do any stuff here

// save the content without the meta charset tag
$new_content = str_replace('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">', '', $dom->saveHTML());

// save to a destination file
file_put_contents('dest.html', $new_content);

Upvotes: 0

feeela
feeela

Reputation: 29932

I've found a hint in the comments of DOMDocument::loadHTML documentation:

(Comment from <mdmitry at gmail dot com> 21-Dec-2009 05:02: "You can also load HTML as UTF-8 using this simple hack:")

Just add '<?xml encoding="UTF-8">' before the HTML-input:

$doc = new DOMDocument();
//$doc->substituteEntities = false;
$doc->loadHTML('<?xml encoding="UTF-8">' . '<p>¯\(°_o)/¯</p>');
print_r($doc->saveHTML());

Upvotes: 6

love2code94
love2code94

Reputation: 71

<?xml version="1.0" encoding="utf-8">

in the top of the document takes care of tags.. for both saveXML and saveHTML.

Upvotes: 3

Related Questions