Reputation:
I'm trying to load an XML source from a remote location, so i have no control of the formatting. Unfortunately the XML file I'm trying to load has no encoding:
<ROOT xmlns:sql="urn:schemas-microsoft-com:xml-sql"> <NODE> </NODE> </ROOT>
When trying something like:
$doc = new DOMDocument( );
$doc->load(URI);
I get:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x38 0x2C 0x38
Ive looked at ways to suppress this, but no luck. How should I load this so that I can use it with DOMDocument?
Upvotes: 2
Views: 14256
Reputation: 166399
You've to convert your document into UTF-8, the easiest would be to use utf8_encode().
DOMdocument example:
$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);
SimpleXML example:
$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));
If you don't know the current encoding, use mb_detect_encoding(), for example:
$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);
Notes:
$doc->loadHTML
instead, you can still use XML header.If you know the encoding, use iconv() to convert it:
$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)
Upvotes: 2
Reputation: 406
I ran in to a similar situation. I was getting an XML file that was supposed to be UTF-8 encoded, but it included some bad ISO characters.
I wrote the following code to encode the bad characters to UTF-8
<?php
# The XML file with bad characters
$filename = "sample_xml_file.xml";
# Read file contents to a variable
$contents = file_get_contents($filename);
# Find the bad characters
preg_match_all('/[^(\x20-\x7F)]*/', $contents, $badchars);
# Process bad characters if some were found
if(isset($badchars[0]))
{
# Narrow down the results to uniques only
$badchars[0] = array_unique($badchars[0]);
# Replace the bad characters with their UTF8 equivalents
foreach($badchars[0] as $badchar)
{
$contents = preg_replace("/".$badchar."/", utf8_encode($badchar), $contents);
}
}
# Write the fixed contents back to the file
file_put_contents($filename, $contents);
# Cleanup
unset($contents);
# Now the bad characters have been encoded to UTF8
# It will now load file with DOMDocument
$dom = new DOMDocument();
$dom->load($filename);
?>
I posted about the solution in more detail at: http://dev.strategystar.net/2012/01/convert-bad-characters-to-utf-8-in-an-xml-file-with-php/
Upvotes: -1
Reputation: 7604
You could edit the document ('pre-process it') to specify the encoding it is being delivered in adding an XML declaration. What that is, you'll have to ascertain yourself, of course. The DOM object should then parse it.
Example XML declaration:
<?xml version="1.0" encoding="UTF-8" ?>
Upvotes: 1
Reputation: 10220
You can try using the XMLReader class instead. The XMLReader is designed specifically for XML and has options for what encoding to use (including 'null' for none).
Upvotes: 0