Reputation: 2100
I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string
the special character (the i) becomes: ì
which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
Upvotes: 4
Views: 17838
Reputation: 5752
Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode
like below:
html_entity_decode($xml->saveXML());
Upvotes: 0
Reputation: 17555
XML is strict when it comes to entities, like & should be &amp;
and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}
Upvotes: 0
Reputation: 25956
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv
to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
Upvotes: 6
Reputation: 1879
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8. If it's still not good, try printing the variable using uft8_encode or utf8_decode.
Upvotes: 0
Reputation: 449783
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode
as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
Upvotes: 0