Ted S
Ted S

Reputation: 337

Forcing UTF8 Format with PHP's XMLReader, DOM and SimpleXML

We have a script that parses XML feeds from user generated sources which from time to time contain improperly formated entries with special characters.

While I would normally just run utf8_encode() on the line, I'm not certain how to do this as DOM is progressively reading the file and the error is thrown as the expand command takes place.

Since simple_xml chokes on the code, subsequent lines are also off.

Here's the code.

$z = new XMLReader; 
$z->open($filename); $doc = new DOMDocument('1.0','UTF-8');         
while ($z->read() && $z->name !== 'product');   
while ($z->nodeType == XMLReader::ELEMENT AND $z->name === 'product'){
$producti = simplexml_import_dom($doc->importNode($z->expand(), true));
print_r($producti);
}

Errors:

Message: XMLReader::expand(): foo.xml:29081: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x05 0x20 0x2D 0x35

Severity: Warning

Message: XMLReader::expand(): An Error Occured while expanding

Filename: controllers/feeds.php

Line Number: 106

Message: Argument 1 passed to DOMDocument::importNode() must be an instance of DOMNode, boolean given

Filename: controllers/feeds.php

Line Number: 106

Upvotes: 2

Views: 2184

Answers (1)

s.webbandit
s.webbandit

Reputation: 17020

Use HTML Tidy library first to clean your string.

Also I'd better use DOMDocument instead of XMLReader.

Something like that:

        $tidy = new Tidy;

        $config = array(
                'drop-font-tags' => true,
                'drop-proprietary-attributes' => true,
                'hide-comments' => true,
                'indent' => true,
                'logical-emphasis' => true,
                'numeric-entities' => true,
                'output-xhtml' => true,
                'wrap' => 0
        );

        $tidy->parseString($html, $config, 'utf8');

        $tidy->cleanRepair();

        $xml = $tidy->value; // Get clear string

        $dom = new DOMDocument;

        $dom->loadXML($xml);

        ...

Upvotes: 1

Related Questions