Jakub
Jakub

Reputation: 415

How to disable auto-entitying by DOMDocument

I'm parsing a HTML string with DOMDocument. I'm loading it this way:

$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8"><div id="container">'.$text.'</div>', LIBXML_NOENT);

Then, I'm running some XPath queries and node replacing on it. (Even if I comment out these actions, the characters are still getting replaced.) Finally, I'm saving it this way:

$parsed = $dom->saveHTML();

But DOMDocument seems to be replacing non-ASCII characters with its entity representation. For example, string in Czech language:

ěščřžýáíé

Returns:

&#283;&scaron;&#269;&#345;&#382;&yacute;&aacute;&iacute;&eacute;

I can't use html_entity_decode(), because it breaks already highlighted and properly escaped source code.

What should I do to disable automatic replacing non-ASCII characters with entities by DOMDocument (so, the above example will return also ěščřžýáíé)?

Upvotes: 0

Views: 674

Answers (2)

Jakub
Jakub

Reputation: 415

Finally, I have a solution. It is so simple, just instead of <?xml encoding="UTF-8"> use <meta http-equiv="content-type" content="text/html;charset=utf-8">.

Upvotes: 1

Andrii
Andrii

Reputation: 51

$dom = new DOMDocument();
$text = <div id="container">'.$text.'</div>';
$text = mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8"); 
$dom->encoding='UTF-8';
$dom->loadHTML($text);

ok, do you try to change method:

$dom->loadXML();

by defaul it uses utf-8, but $text must be xhtml formated if $text is not formated try:

$dom->loadHTML('<meta charset="utf-8"/>'.$text);

If you read data from browser, try this:

  echo '<meta charset="utf-8" />';
  echo  $parsed;

Upvotes: 0

Related Questions