Reputation: 1168
I am trying to write a function which could read an existing XML file and create a new one with all the data from the first one, but in a different encoding. As far I understand it, SimpleXML saves the file in UTF-8 encoding. My original XML file is Windows-1257.
Code:
public static function toUTF8()
{
$remote_file = "data/test/import/test.xml";
$xml = simplexml_load_file($remote_file);
$xml->asXml('data/test/import/utf8/test.xml');
echo var_dump('done');
exit;
}
This way the encoding of file is still not good. I wanted to try this:
$newXML = new SimpleXMLElement($xml);
But this takes only plain XML code as a parameter. How could I get the whole XML code from the object? Or how else could I create a new UTF-8 XML object and insert all the data from the old file?
Upvotes: 2
Views: 7432
Reputation: 42712
I tried this out and saw problems importing the XML directly with SimpleXML. Despite the correct encoding declaration in the XML, it would output the wrong characters. So the alternative is to use a function like iconv
which can do the conversion for you.
If you don't need to parse the XML, you can just do this directly:
<?php
$remote_file = "data/test/import/test.xml";
$new_file = "data/test/import/utf8/test.xml";
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("CP1257", "UTF-8", $baltic_xml);
file_put_contents($new_file, $unicode_xml);
If you need to do stuff with the XML, it gets a little more complicated because you have to update the character set in the XML declaration.
<?php
$remote_file = "data/test/import/test.xml";
$new_file = "data/test/import/utf8/test.xml";
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("CP1257", "UTF-8", $baltic_xml);
$unicode_xml = str_replace('encoding="CP1257"', 'encoding="UTF-8"', $unicode_xml);
$xml = new SimpleXMLElement($unicode_xml);
// do stuff with $xml
$xml->asXml($new_file);
I tested this out with the following file (saved as CP1257) and it worked fine:
<?xml version="1.0" encoding="CP1257"?>
<Root-Element>
<Test>Łų߯ĒČ</Test>
</Root-Element>
Upvotes: 1
Reputation: 146450
Unless I'm wrong, the SimpleXML extension will just use the same encoding all the way through. UTF-8
is the default if no encoding is given but, if the original document has encoding information such encoding will be used.
You can use DOMDocument as proxy:
$xml = simplexml_load_file(__DIR__ . '/test.xml');
$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->encoding = 'UTF-8';
$xml->asXml('as-utf-8.xml');
Upvotes: 1