DOM xpath breaks quotation marks

Question

I'm trying to parse some UTF-8 encoded html text that contains the left and right quotation marks ’ But when I try to get the value of the html back from DOM with saveHTML(), the quotation marks always get messed up.

Now I've tried several including utf8_encoding the text before putting it into DOM, I've tried putting ('1.0', 'UTF-8') into the constructor and it also didn't work.

I'm running out of ideas how to sort this out. Converting the quotation marks into html entities isn't the option for me.

Here is a simplified example that breaks the quotation marks:

$a = "won’t you, will you, won’t you, join the ";
$dom = new DOMDocument();

$dom->loadHTML($a);

$xpath = new DOMXPath($dom);

$tag = $xpath->query('//div');

foreach($tag as $t)
    echo $dom->saveHTML($t);

the returned text looks like: will you, wonât you, will you, wonât you, join the

Dmitri Snytkine · Accepted Answer

Ok, if you insist on using loadHTML then try this:

add an appropriate meta tag to your html first, like this:

$a = "
    
won’t you, will you, won’t you, join the ";

Then you can use loadHTML($a) and it will work

Lastly, you you just cannot add the extra meta tag then you can try this: use $dom->loadHTML(utf8_decode($a) ); this will work too as it will first convert your string to latin-1 from utf8 which will then be loaded into dom in latin-1 charset and you will get output as latin-1 also.

DOM xpath breaks quotation marks

Answers (2)

Related Questions