Reputation: 55604
I have an XML file which I'm parsing with SimpleXML in php. The first line is
<?xml version="1.0" encoding="iso-8859-1"?>
The result of the parse is stored in $xml
, if I do:
echo $xml->asXML();
then the entire file displays perfectly.
But if I dig into the structure in anyway, I get Â's everwhere, eg:
echo $xml->Chapter->asXML();
Inside some of the XML elements there is MathML (<math>
), this is where the Â's occur.
For example the character ∈
is replaced by a Â.
How can I parse the XML file but not lose the MathML characters?
Upvotes: 2
Views: 784
Reputation: 1281
The problem is not your encoding, the problem is that not all browsers support MathML that your script is echoing to the browser.
http://en.wikipedia.org/wiki/MathML#Web_browsers
Tested this in the following browser:
Upvotes: 0
Reputation: 51970
∈ is not a character that can be represented in ISO 8859-1, change your XML to say that it is encoded with UTF-8.
To give an example demonstrating the problem.
$x = simplexml_load_string('<?xml version="1.0" encoding="iso-8859-1"?>
<example><math>∈</math></example>');
echo $x->math, PHP_EOL;
$x = simplexml_load_string('<?xml version="1.0" encoding="utf-8"?>
<example><math>∈</math></example>');
echo $x->math, PHP_EOL;
Outputs (as UTF-8) the following.
â
∈
SimpleXML will try to convert to UTF-8 when the encoding
is set to something different. It is always a good idea not to give it that work to do when the input is already UTF-8 encoded and the encoding
declaration is incorrect.
Also be sure that PHP itself is outputting UTF-8, and telling the browser that this is the case!
You can do this by setting the default_charset
INI option (in your php.ini or with ini_set()
), or sending the correct Content-Type
header (header('Content-Type: text/html; charset=utf-8')
).
Upvotes: 2
Reputation: 1511
You may need to convert the input into other encoding before parsing it with SimpleXML.
For this, function iconv() is very useful: http://php.net/manual/en/function.iconv.php
Upvotes: -1