Shaakir
Shaakir

Reputation: 484

SimpleXML and french characters

I work for a International company and thus we have loads of languages to cater for. I'm having a problem with some special characters.

I created a standalone test php page to eliminate any other issues that could be introduced by my system.

From various pages i read through i found that SimpleXML processed XML as UTF-8. Eg : PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

SO i did just that at top of the page:

header("Content-type:text/html; charset=UTF-8");

THen i did this to check :

print mb_internal_encoding();

Not sure if this is the right function but it gave me ISO-8859-1 in FF and Chome.

XML looks like this:

$xml = '<?xml version="1.0" encoding="ISO-8859-15"?>
<Tracking>
<File>
<FileNumber>çúé$`~  €   Š   š   Ž   ž   Œ   œ   Ÿ</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>';

This prints out all funny, but for the page i need, i'm not too concrened how it prints out in browser as the actual page will actually run from a cron to import the XML into a MYSQL DB, so dislay not too important. It displays on FF like this though

print $xml;
���$`~ � � � � � � � � � 124

Then i create the SimpleXML object :

$parser = new SimpleXMLElement($xml);
print_r($parser);

This prints out :

[File] => SimpleXMLElement Object
    (
        [FileNumber] => çúé$`~                           
        [OrigBranch] => 124
        [Login] => SimpleXMLElement Object
            (
            )

    )

I'm not too worried about the funny characters in the print $xml;, but more need to fix the characters in the SimpleXMLElement Object that is being inserted into the DB. Why is the SimpleXMLElement Object losing the character after the '~'. I tried to change the charset to ISO-8859-15 in header function call, but this only lead to the print $xml; looking slightly better , but still missing characters after '~', but SimpleXMLElement give fatal error :

'String could not be parsed as XML

I tried before parsing XML :

$xml = mb_convert_encoding($xml, "ISO-8859-15");
$xml = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xml)

But these did not help either. Any suggestions?

Upvotes: 0

Views: 5396

Answers (3)

Stramaz
Stramaz

Reputation: 51

I created a specific file in latin1(ISO-8859-1) named latin1.xml with this content (you can add encoding="UTF-8" in the xml tag, it's the same):

<?xml version="1.0"?>
<Tracking>
<File>
<FileNumber>çùé$ °à §çòò àù§</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>

Then I loaded the content in the php file and made the conversion from ISO-8859-1 to UTF-8, after that the parsing with SimpleXMLElement. I echoed the content of the xml before

<?php
$xml = file_get_contents('latin1.xml');
echo '<pre>'.$xml.'</pre>'."<br>";
$xml2 = iconv("ISO-8859-1","UTF-8",$xml);
echo '<pre>'.$xml2.'</pre>'."<br>";
$parser = new SimpleXMLElement($xml2);
echo '<pre>'.print_r($parser).'</pre>'."<br>";

Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser). Otherwise if the browser is set with ISO-8859-1 then you will see the first echo good but not the second and the print_r.

You can adjust to fit your needs.

UPDATE

ISO/IEC 8859-1 is missing some characters for French and Finnish text, as well as the euro sign. If I understand well your comments you can have the source file (xml) in ISO-8859-15, in this way you can use correctly the euro sign. I made a new file, named iso8859-15.xml, and put you new test characters there (with euro sign too). In the php file I changed the first instruction:

//$xml = file_get_contents('latin1.xml');
$xml = file_get_contents('iso8859-15.xml');

and, later, the conversion in:

$xml2 = iconv("ISO-8859-15","UTF-8",$xml);

Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser), the output of SimpleXml.

So, now that you have your parsed xml rightly (in UTF-8) you can convert it before write on DB (that is in ISO-8859-15 encoding, if I correctly understood). To be more clear you can add this line, at the end, to the php script above:

echo '<pre> File number in ISO-8859-15 for db: '.iconv("UTF-8","ISO-8859-15",$parser->File->FileNumber).'</pre>'."<br>";

As you can see I converted the UTF-8 data from the simpleXml in ISO-8859-15, as you should do when you'll write on DB. That worked for me.

Hope it helps

Upvotes: 2

Mina
Mina

Reputation: 1516

Try $xml = '<?xml version="1.0" encoding="UTF-8"?>...

Upvotes: -1

btlr.com
btlr.com

Reputation: 139

If you build XML, try to base64 decode all strings and then on the client side where you read the XML encode them back

Upvotes: 0

Related Questions