Reputation: 1779
I have a php application which -sometimes- fails (depends on what data I load) and gives errors like:
parser error : PCDATA invalid Char value 11
Warning: simplexml_load_file(): ath>/datadrivenbestpractices/Data-driven Best Practices in
Warning: simplexml_load_file(): ^ in
I am certain that there are some values which are causing the problem. I don't have control over data. I have tried solutions from: Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string and How to handle invalid unicode with simplexml and How to skip invalid characters in XML file using PHP but they have not helped.
The culprit strings are: 'Data Driven - Best Practices' and 'Data-driven Best Practices to Recruit and Retain Underrepresented Graduate Students May 12, 2011 - 1:30-3:00 p.m., EST' (may be dashes or return characters).
What can I do? Mine is a Windows php test environment but the live environment will be a LAMP one--can 't touch the .ini files.
Thanks.
Upvotes: 13
Views: 12253
Reputation: 1779
Never mind, the answer in: How to skip invalid characters in XML file using PHP did work. Here is my code:
stream_filter_register('xmlutf8', 'ValidUTF8XMLFilter');
class ValidUTF8XMLFilter extends php_user_filter
{
protected static $pattern = '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u';
function filter($in, $out, &$consumed, $closing)
{
while ($bucket = stream_bucket_make_writeable($in)) {
$bucket->data = preg_replace(self::$pattern, '', $bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
$doc = simplexml_load_file("php://filter/read=xmlutf8/resource=".$serveraddress.$myparam);
Upvotes: 0
Reputation:
Stripping the invalid chars before parsing would be the easiest fix:
function utf8_for_xml($string)
{
return preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $string);
}
From: PHP generated XML shows invalid Char value 27 message
Upvotes: 18