IrfanClemson
IrfanClemson

Reputation: 1779

php: SimpleXML Load File Invalid Character Error

I have a php application which -sometimes- fails (depends on what data I load) and gives errors like:

parser error : PCDATA invalid Char value 11
Warning: simplexml_load_file(): ath>/datadrivenbestpractices/Data-driven Best Practices in 
Warning: simplexml_load_file(): ^ in 

I am certain that there are some values which are causing the problem. I don't have control over data. I have tried solutions from: Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string and How to handle invalid unicode with simplexml and How to skip invalid characters in XML file using PHP but they have not helped.

The culprit strings are: 'Data Driven - Best Practices' and 'Data-driven Best Practices to Recruit and Retain Underrepresented Graduate Students May 12, 2011 - 1:30-3:00 p.m., EST' (may be dashes or return characters).

What can I do? Mine is a Windows php test environment but the live environment will be a LAMP one--can 't touch the .ini files.

Thanks.

Upvotes: 13

Views: 12253

Answers (2)

IrfanClemson
IrfanClemson

Reputation: 1779

Never mind, the answer in: How to skip invalid characters in XML file using PHP did work. Here is my code:

stream_filter_register('xmlutf8', 'ValidUTF8XMLFilter');

class ValidUTF8XMLFilter extends php_user_filter
{
    protected static $pattern = '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u';

    function filter($in, $out, &$consumed, $closing)
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = preg_replace(self::$pattern, '', $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

$doc = simplexml_load_file("php://filter/read=xmlutf8/resource=".$serveraddress.$myparam);

Upvotes: 0

user1997244
user1997244

Reputation:

Stripping the invalid chars before parsing would be the easiest fix:

function utf8_for_xml($string)
{
    return preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $string);
}

From: PHP generated XML shows invalid Char value 27 message

Upvotes: 18

Related Questions