Reputation: 484
I work for a International company and thus we have loads of languages to cater for. I'm having a problem with some special characters.
I created a standalone test php page to eliminate any other issues that could be introduced by my system.
From various pages i read through i found that SimpleXML processed XML as UTF-8. Eg : PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes
SO i did just that at top of the page:
header("Content-type:text/html; charset=UTF-8");
THen i did this to check :
print mb_internal_encoding();
Not sure if this is the right function but it gave me ISO-8859-1 in FF and Chome.
XML looks like this:
$xml = '<?xml version="1.0" encoding="ISO-8859-15"?>
<Tracking>
<File>
<FileNumber>çúé$`~ € Š š Ž ž Œ œ Ÿ</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>';
This prints out all funny, but for the page i need, i'm not too concrened how it prints out in browser as the actual page will actually run from a cron to import the XML into a MYSQL DB, so dislay not too important. It displays on FF like this though
print $xml;
���$`~ � � � � � � � � � 124
Then i create the SimpleXML object :
$parser = new SimpleXMLElement($xml);
print_r($parser);
This prints out :
[File] => SimpleXMLElement Object
(
[FileNumber] => çúé$`~
[OrigBranch] => 124
[Login] => SimpleXMLElement Object
(
)
)
I'm not too worried about the funny characters in the print $xml;, but more need to fix the characters in the SimpleXMLElement Object that is being inserted into the DB. Why is the SimpleXMLElement Object losing the character after the '~'. I tried to change the charset to ISO-8859-15 in header function call, but this only lead to the print $xml; looking slightly better , but still missing characters after '~', but SimpleXMLElement give fatal error :
'String could not be parsed as XML
I tried before parsing XML :
$xml = mb_convert_encoding($xml, "ISO-8859-15");
$xml = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xml)
But these did not help either. Any suggestions?
Upvotes: 0
Views: 5396
Reputation: 51
I created a specific file in latin1(ISO-8859-1
) named latin1.xml
with this content (you can add encoding="UTF-8"
in the xml tag, it's the same):
<?xml version="1.0"?>
<Tracking>
<File>
<FileNumber>çùé$ °à §çòò àù§</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>
Then I loaded the content in the php file and made the conversion from ISO-8859-1
to UTF-8
, after that the parsing with SimpleXMLElement
.
I echoed the content of the xml before
<?php
$xml = file_get_contents('latin1.xml');
echo '<pre>'.$xml.'</pre>'."<br>";
$xml2 = iconv("ISO-8859-1","UTF-8",$xml);
echo '<pre>'.$xml2.'</pre>'."<br>";
$parser = new SimpleXMLElement($xml2);
echo '<pre>'.print_r($parser).'</pre>'."<br>";
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser)
. Otherwise if the browser is set with ISO-8859-1 then you will see the first echo good but not the second and the print_r.
You can adjust to fit your needs.
UPDATE
ISO/IEC 8859-1 is missing some characters for French and Finnish text, as well as the euro sign.
If I understand well your comments you can have the source file (xml) in ISO-8859-15
, in this way you can use correctly the euro sign.
I made a new file, named iso8859-15.xml
, and put you new test characters there (with euro sign too). In the php file I changed the first instruction:
//$xml = file_get_contents('latin1.xml');
$xml = file_get_contents('iso8859-15.xml');
and, later, the conversion in:
$xml2 = iconv("ISO-8859-15","UTF-8",$xml);
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser)
, the output of SimpleXml.
So, now that you have your parsed xml rightly (in UTF-8
) you can convert it before write on DB (that is in ISO-8859-15
encoding, if I correctly understood).
To be more clear you can add this line, at the end, to the php script above:
echo '<pre> File number in ISO-8859-15 for db: '.iconv("UTF-8","ISO-8859-15",$parser->File->FileNumber).'</pre>'."<br>";
As you can see I converted the UTF-8
data from the simpleXml in ISO-8859-15
, as you should do when you'll write on DB.
That worked for me.
Hope it helps
Upvotes: 2
Reputation: 139
If you build XML, try to base64 decode all strings and then on the client side where you read the XML encode them back
Upvotes: 0