Reputation: 415
I'm parsing a HTML string with DOMDocument. I'm loading it this way:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8"><div id="container">'.$text.'</div>', LIBXML_NOENT);
Then, I'm running some XPath queries and node replacing on it. (Even if I comment out these actions, the characters are still getting replaced.) Finally, I'm saving it this way:
$parsed = $dom->saveHTML();
But DOMDocument seems to be replacing non-ASCII characters with its entity representation. For example, string in Czech language:
ěščřžýáíé
Returns:
ěščřžýáíé
I can't use html_entity_decode()
, because it breaks already highlighted and properly escaped source code.
What should I do to disable automatic replacing non-ASCII characters with entities by DOMDocument (so, the above example will return also ěščřžýáíé
)?
Upvotes: 0
Views: 674
Reputation: 415
Finally, I have a solution. It is so simple, just instead of <?xml encoding="UTF-8">
use <meta http-equiv="content-type" content="text/html;charset=utf-8">
.
Upvotes: 1
Reputation: 51
$dom = new DOMDocument();
$text = <div id="container">'.$text.'</div>';
$text = mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8");
$dom->encoding='UTF-8';
$dom->loadHTML($text);
ok, do you try to change method:
$dom->loadXML();
by defaul it uses utf-8, but $text must be xhtml formated if $text is not formated try:
$dom->loadHTML('<meta charset="utf-8"/>'.$text);
If you read data from browser, try this:
echo '<meta charset="utf-8" />';
echo $parsed;
Upvotes: 0