Tadej Magajna
Tadej Magajna

Reputation: 2963

DOM xpath breaks quotation marks

I'm trying to parse some UTF-8 encoded html text that contains the left and right quotation marks ’ But when I try to get the value of the html back from DOM with saveHTML(), the quotation marks always get messed up.

Now I've tried several including utf8_encoding the text before putting it into DOM, I've tried putting ('1.0', 'UTF-8') into the constructor and it also didn't work.

I'm running out of ideas how to sort this out. Converting the quotation marks into html entities isn't the option for me.

Here is a simplified example that breaks the quotation marks:

$a = "<html><body><div>won’t you, will you, won’t you, join the </div></body></html>";
$dom = new DOMDocument();

$dom->loadHTML($a);

$xpath = new DOMXPath($dom);

$tag = $xpath->query('//div');

foreach($tag as $t)
    echo $dom->saveHTML($t);

the returned text looks like: will you, wonât you, will you, wonât you, join the

Upvotes: 1

Views: 195

Answers (2)

Dmitri Snytkine
Dmitri Snytkine

Reputation: 1126

Ok, if you insist on using loadHTML then try this:

add an appropriate meta tag to your html first, like this:

$a = "<html>
    <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
<body><div>won’t you, will you, won’t you, join the </div></body></html>";

Then you can use loadHTML($a) and it will work

Lastly, you you just cannot add the extra meta tag then you can try this: use $dom->loadHTML(utf8_decode($a) ); this will work too as it will first convert your string to latin-1 from utf8 which will then be loaded into dom in latin-1 charset and you will get output as latin-1 also.

Upvotes: 1

Dmitri Snytkine
Dmitri Snytkine

Reputation: 1126

The solution seems to be to use $dom->loadXML($a) instead of loadHTML() I tried it and it worked for me.

Upvotes: 1

Related Questions