Xeoncross
Xeoncross

Reputation: 57184

How do you properly decode special characters in XML files?

In some of the XML files I'm parsing (often RSS) I run across text which contains characters like Today’s Newest which is becoming Today’s Newest after I extract the text from the node. This tells me I'm handling the decoding process incorrectly.

I could simply patch my script to fix this one bug, but what if there are many other characters that are becoming garbled? What is the proper way to digest XML files without trashing the encoding when converting it to a UTF-8 script?

Here are some of the things I've tried which don't seem to quite work:

$xml = file_get_contents($file);

// One: still contains ’
//$xml = @iconv('UTF-8', 'UTF-8//IGNORE', $xml);

// Two: LibXMLError Entity 'rsquo' not defined
//$xml = htmlentities($xml, null, 'UTF-8');
//$xml = htmlspecialchars_decode($xml, ENT_QUOTES);

// Three: still contains ’
//$xml = mb_convert_encoding($xml, "UTF-8", "UTF-8");

$xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT);

Upvotes: 4

Views: 1965

Answers (2)

zysoft
zysoft

Reputation: 2358

Check how you output your content. This could also happen if the output target does not support UTF-8.

I assume you output to a browser, so check browser encoding and try explicitly setting it to UTF-8 as you might get correct text from XML but it just displays wrong.

Also try loading XML with DOMDocument if above doesn't help

Upvotes: 1

Kalpesh
Kalpesh

Reputation: 5685

Give this a try:

$xml = simplexml_load_string($xml, null, LIBXML_NOCDATA | LIBXML_NOENT); $xml->addAttribute('encoding', 'UTF-8');

Upvotes: 1

Related Questions