How do you properly decode special characters in XML files?

Question

In some of the XML files I'm parsing (often RSS) I run across text which contains characters like Today’s Newest which is becoming Todayâ€™s Newest after I extract the text from the node. This tells me I'm handling the decoding process incorrectly.

I could simply patch my script to fix this one bug, but what if there are many other characters that are becoming garbled? What is the proper way to digest XML files without trashing the encoding when converting it to a UTF-8 script?

Here are some of the things I've tried which don't seem to quite work:

$xml = file_get_contents($file);

// One: still contains â€™
//$xml = @iconv('UTF-8', 'UTF-8//IGNORE', $xml);

// Two: LibXMLError Entity 'rsquo' not defined
//$xml = htmlentities($xml, null, 'UTF-8');
//$xml = htmlspecialchars_decode($xml, ENT_QUOTES);

// Three: still contains â€™
//$xml = mb_convert_encoding($xml, "UTF-8", "UTF-8");

$xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT);

zysoft · Accepted Answer

Check how you output your content. This could also happen if the output target does not support UTF-8.

I assume you output to a browser, so check browser encoding and try explicitly setting it to UTF-8 as you might get correct text from XML but it just displays wrong.

Also try loading XML with DOMDocument if above doesn't help

How do you properly decode special characters in XML files?

Answers (2)

Related Questions