Reputation: 490123
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str));
$dom->loadHTML($str);
var_dump($dom->saveHTML());
View.
string(5) "UTF-8"
string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello®</p></body></html>
"
Why did my Unicode ®
get converted to ®
and how do I stop this?
Am I going crazy today?
Upvotes: 6
Views: 1778
Reputation: 1099
I fixed this decoding the UTF-8 before passing it to loadHTML.
$dom->loadHTML( utf8_decode( $html ) );
saveHTML()
seems to decode special chars like German umlauts to their HTML entities. (Although I set $dom->substituteEntities=false;
... o.O)
This is quite strange, though, as the documentation states:
The DOM extension uses UTF-8 encoding.
(http://www.php.net/manual/de/class.domdocument.php, search for utf8)
Oh dear, encoding in PHP poses problems again and again... never ending story.
Upvotes: 4
Reputation: 51
You can add an xml encoding tag (and take it out later). This works for me on things that are not stock Centos 5.x (ubuntu, cpanel's php):
<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str));
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML());
This is what you get:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello®</p></body></html>
Except on days when you get this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello®</p></body></html>
Upvotes: 5
Reputation: 798436
Your text editor says "®"
in UTF-8, but the bytes in the file say "®"
in Latin-1 (or a similar encoding), which is what PHP is using to read it. Using the character entity reference will remove this ambiguity.
>>> print u'®'.encode('utf-8').decode('latin-1')
®
Upvotes: 2