Reputation: 337
We have a script that parses XML feeds from user generated sources which from time to time contain improperly formated entries with special characters.
While I would normally just run utf8_encode() on the line, I'm not certain how to do this as DOM is progressively reading the file and the error is thrown as the expand command takes place.
Since simple_xml chokes on the code, subsequent lines are also off.
Here's the code.
$z = new XMLReader;
$z->open($filename); $doc = new DOMDocument('1.0','UTF-8');
while ($z->read() && $z->name !== 'product');
while ($z->nodeType == XMLReader::ELEMENT AND $z->name === 'product'){
$producti = simplexml_import_dom($doc->importNode($z->expand(), true));
print_r($producti);
}
Errors:
Message: XMLReader::expand(): foo.xml:29081: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x20 0x2D 0x35
Severity: Warning
Message: XMLReader::expand(): An Error Occured while expanding
Filename: controllers/feeds.php
Line Number: 106
Message: Argument 1 passed to DOMDocument::importNode() must be an instance of DOMNode, boolean given
Filename: controllers/feeds.php
Line Number: 106
Upvotes: 2
Views: 2184
Reputation: 17020
Use HTML Tidy library first to clean your string.
Also I'd better use DOMDocument instead of XMLReader.
Something like that:
$tidy = new Tidy;
$config = array(
'drop-font-tags' => true,
'drop-proprietary-attributes' => true,
'hide-comments' => true,
'indent' => true,
'logical-emphasis' => true,
'numeric-entities' => true,
'output-xhtml' => true,
'wrap' => 0
);
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$xml = $tidy->value; // Get clear string
$dom = new DOMDocument;
$dom->loadXML($xml);
...
Upvotes: 1