Reputation: 2963
I'm trying to parse some UTF-8 encoded html text that contains the left and right quotation marks ’ But when I try to get the value of the html back from DOM with saveHTML(), the quotation marks always get messed up.
Now I've tried several including utf8_encoding the text before putting it into DOM, I've tried putting ('1.0', 'UTF-8') into the constructor and it also didn't work.
I'm running out of ideas how to sort this out. Converting the quotation marks into html entities isn't the option for me.
Here is a simplified example that breaks the quotation marks:
$a = "<html><body><div>won’t you, will you, won’t you, join the </div></body></html>";
$dom = new DOMDocument();
$dom->loadHTML($a);
$xpath = new DOMXPath($dom);
$tag = $xpath->query('//div');
foreach($tag as $t)
echo $dom->saveHTML($t);
the returned text looks like: will you, wonât you, will you, wonât you, join the
Upvotes: 1
Views: 195
Reputation: 1126
Ok, if you insist on using loadHTML then try this:
add an appropriate meta tag to your html first, like this:
$a = "<html>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
<body><div>won’t you, will you, won’t you, join the </div></body></html>";
Then you can use loadHTML($a) and it will work
Lastly, you you just cannot add the extra meta tag then you can try this: use $dom->loadHTML(utf8_decode($a) ); this will work too as it will first convert your string to latin-1 from utf8 which will then be loaded into dom in latin-1 charset and you will get output as latin-1 also.
Upvotes: 1
Reputation: 1126
The solution seems to be to use $dom->loadXML($a) instead of loadHTML() I tried it and it worked for me.
Upvotes: 1