Reputation: 57184
In some of the XML files I'm parsing (often RSS) I run across text which contains characters like Today’s Newest
which is becoming Today’s Newest
after I extract the text from the node. This tells me I'm handling the decoding process incorrectly.
I could simply patch my script to fix this one bug, but what if there are many other characters that are becoming garbled? What is the proper way to digest XML files without trashing the encoding when converting it to a UTF-8 script?
Here are some of the things I've tried which don't seem to quite work:
$xml = file_get_contents($file);
// One: still contains ’
//$xml = @iconv('UTF-8', 'UTF-8//IGNORE', $xml);
// Two: LibXMLError Entity 'rsquo' not defined
//$xml = htmlentities($xml, null, 'UTF-8');
//$xml = htmlspecialchars_decode($xml, ENT_QUOTES);
// Three: still contains ’
//$xml = mb_convert_encoding($xml, "UTF-8", "UTF-8");
$xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT);
Upvotes: 4
Views: 1965
Reputation: 2358
Check how you output your content. This could also happen if the output target does not support UTF-8.
I assume you output to a browser, so check browser encoding and try explicitly setting it to UTF-8 as you might get correct text from XML but it just displays wrong.
Also try loading XML with DOMDocument if above doesn't help
Upvotes: 1
Reputation: 5685
Give this a try:
$xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT);
$xml->addAttribute('encoding', 'UTF-8');
Upvotes: 1