user11398537
user11398537

Reputation:

Is it possible to change an xml to UTF-8 using PHP

I have an XML document that is an ITF-16 LE Encoding. Because of that, It is not readable using wp all import.

When I look in the version section, I see this

<?xml version="1.0" encoding="Unicode" ?> And in my visual studio code I at the bottom I see. UTF-16 LE

I already changed using Visual studio, but since it going to be a new file every time (in the same format). It would be great if PHP could transform it into UTF-8

<?xml version="1.0" encoding="Unicode" ?>
<root>
  <docs>

Is it possible to change the encoding of this file using PHP?

Upvotes: 0

Views: 843

Answers (2)

ThW
ThW

Reputation: 19512

DOMDocument::loadXML() reads the encoding attribute from the XML declaration. But Unicode is not a valid encoding afaik - I would expect UTF-16LE. The DOM API in PHP uses UTF-8. So it will decode anything to UTF-8 (depending on the defined encoding) and encode it depending on the encoding of the target document. You can just change it after loading.

Here is a demo:

$xml = <<<'XML'
<?xml version="1.0" encoding="utf-8"?>
<foo>ÄÖÜ</foo>
XML;

$document = new DOMDocument();
$document->loadXML($xml);

$encodings = ['ASCII', 'UTF-16', 'UTF-16LE', 'UTF-16BE'];

foreach ($encodings as $encoding) {
    // set required encoding
    $document->encoding = $encoding;
    // save
    echo $encoding."\n".$document->saveXML()."\n";
}

Output:

ASCII
<?xml version="1.0" encoding="ASCII"?>
<foo>&#196;&#214;&#220;</foo>

UTF-16
��<?xml version="1.0" encoding="UTF-16"?>
<foo>���</foo>

UTF-16LE
<?xml version="1.0" encoding="UTF-16LE"?>
<foo>���</foo>

UTF-16BE
<?xml version="1.0" encoding="UTF-16BE"?>
<foo>���</foo>

The generated string changes with the defined encoding.

I started with an UTF-8 document here - because SO is UTF-8 itself and you can see the non-ascii characters that way. ASCII triggers the entity encoding for non-ascii characters. UTF-16 adds a BOM to provide the byte order. SO can not display the UTF-16 encoded chars - so you get the � symbol. UTF-16LE and UTF-16BE define the byte order in the encoding, no BOM is needed.

Of course it works the same the other way around.

Upvotes: 1

Yitzhak Khabinsky
Yitzhak Khabinsky

Reputation: 22301

Here is a generic XSLT that will copy your entire input XML as-is, but with the encoding specified in the xsl:output. What is left is just to run an XSLT transformation in PHP.

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" encoding="utf-8"/>

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Upvotes: 1

Related Questions