Dave
Dave

Reputation:

How to load XML when PHP can't indicate the right encoding?

I'm trying to load an XML source from a remote location, so i have no control of the formatting. Unfortunately the XML file I'm trying to load has no encoding:

<ROOT xmlns:sql="urn:schemas-microsoft-com:xml-sql"> <NODE> </NODE> </ROOT>

When trying something like:

$doc = new DOMDocument( );
$doc->load(URI);

I get:

Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x38 0x2C 0x38

Ive looked at ways to suppress this, but no luck. How should I load this so that I can use it with DOMDocument?

Upvotes: 2

Views: 14256

Answers (4)

kenorb
kenorb

Reputation: 166399

You've to convert your document into UTF-8, the easiest would be to use utf8_encode().

DOMdocument example:

$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);

SimpleXML example:

$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));

If you don't know the current encoding, use mb_detect_encoding(), for example:

$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);

Notes:

  • If encoding cannot be detected (function will return FALSE), you may try to force the encoding via utf8_encode().
  • If you're loading html code via $doc->loadHTML instead, you can still use XML header.

If you know the encoding, use iconv() to convert it:

$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)

Upvotes: 2

JV-
JV-

Reputation: 406

I ran in to a similar situation. I was getting an XML file that was supposed to be UTF-8 encoded, but it included some bad ISO characters.

I wrote the following code to encode the bad characters to UTF-8

<?php

# The XML file with bad characters
$filename = "sample_xml_file.xml";

# Read file contents to a variable
$contents = file_get_contents($filename);

# Find the bad characters
preg_match_all('/[^(\x20-\x7F)]*/', $contents, $badchars);

# Process bad characters if some were found
if(isset($badchars[0]))
{
        # Narrow down the results to uniques only
        $badchars[0] = array_unique($badchars[0]);

        # Replace the bad characters with their UTF8 equivalents
        foreach($badchars[0] as $badchar)
        {
                $contents = preg_replace("/".$badchar."/", utf8_encode($badchar), $contents);
        }
}

# Write the fixed contents back to the file
file_put_contents($filename, $contents);

# Cleanup
unset($contents);

# Now the bad characters have been encoded to UTF8
# It will now load file with DOMDocument
$dom = new DOMDocument();
$dom->load($filename);

?>

I posted about the solution in more detail at: http://dev.strategystar.net/2012/01/convert-bad-characters-to-utf-8-in-an-xml-file-with-php/

Upvotes: -1

Rushyo
Rushyo

Reputation: 7604

You could edit the document ('pre-process it') to specify the encoding it is being delivered in adding an XML declaration. What that is, you'll have to ascertain yourself, of course. The DOM object should then parse it.

Example XML declaration:

<?xml version="1.0" encoding="UTF-8" ?>

Upvotes: 1

Steven Surowiec
Steven Surowiec

Reputation: 10220

You can try using the XMLReader class instead. The XMLReader is designed specifically for XML and has options for what encoding to use (including 'null' for none).

Upvotes: 0

Related Questions