Reputation: 529
I think I found a bug in XMLReader::readOuterXML
in PHP 5.5.33 and 5.6.19... PHP 5.2.17 is fine, did not test with 7. My PHP is VC11 x86 Thread Safe, with Apache 2.4.18 VC11 Win32.
When reading an XML file properly encoded in UTF-8 (with or without BOM), readOuterXML
sometimes generate the warning "Input is not proper UTF-8, indicate encoding !"
, even though several UTF-8 encoded characters are read before the offending line.
The same file, with some tags or strings removed, will pass through without problem.
This is a simplified version of the function I use to read the XML file:
function TestXML($file) {
$XR = new XMLReader;
$XR->open($file, null, LIBXML_NOBLANKS);
//Looking for specific node
while (($lastRead = $XR->read()) && ($XR->name !== 'records')) {
;
}
if (!$lastRead) {
echo $file.' : Invalid file or no records';
$XR->close();
return;
}
//Looking for specific node
while (($lastRead = $XR->read()) && ($XR->name !== 'record')) {
;
}
while ($lastRead) {
$xml = $XR->readOuterXML();
if ($xml === '') {
$err = '';
if ($e = libxml_get_last_error()) {
$err = $e->message.' (line: '.$e->line.')';
}
$XR->close();
echo $file.' : Problem with file'.($err ? ' — '.$err : '').'.';
return;
}
//Looking for specific node
while (($lastRead = $XR->next()) && ($XR->name !== 'record')) {
;
}
}
$XR->close();
echo $file.' : Good!';
return;
}
And this is the smallest XML I could produce (without a BOM) that generate the problem:
<?xml version="1.0" encoding="utf-8"?>
<records>
<record><aaa><bbbb><ccc><![CDATA[XXX Xxxxxxxxxxxx]]></ccc><ddd><![CDATA[XXX Xx]]></ddd></bbbb><eee><![CDATA[Xxxxx xxxxxxx: xxxx://xxx.xxx.xx.xx/xxxx?xxxxXx=0xx000x0-000x-0xx0-x000-x0000xx0xx00
Xxxxxxxxxxxx xx Xxxxxxxxxxxx Xxxxxxxxx xx Xxxxxxxxx Xxxxxxxxxxxx Xxxxxxxxxxx Xxxxxxxxxxxx (XXX Xxxxxxxxxxxx), xxxxxxxxx xxxxxxx xx Xxx Xxxxxxxxxx Xxxxxxxxxx Xxx.]]></eee></aaa><fff><bbbb><ggg><![CDATA[Xxxxxxxxx Xxxxxxxxxxxxxxx Xxxxxxxxxx xx Xxxxxxxxxxxx]]></ggg><ccc><![CDATA[XXX Xxxxxxxxxxxx]]></ccc></bbbb><hhh><![CDATA[Xx xxxxx, xx xxxxxxxxxxx XXX Xxxxxxxxxxxx x xxxxxcé x’xxxxxxxx xxx x’Xxxxxxléx léxxxxxxxxx xx xx xxxxxxxx xx xx Xxxxxxxxxx Xxxxxxxxxx Xxx (xxx xxx xx xxxxxxxxxx xxxxxxxxx). Xxxxx xxx xréxxxx xxx xxxxxx xxx déxxxxxxxx XXX Xxxxxxxxxxxx xx xxxx xx’xxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxréxxxxxxtéx xx xxxxxxx xxx XX, xxx XXX xx xxx XXX xx xx xxxxxxxx xx xxxxx x’xxxxxxxx xx xxxxx xx xxxxxxxxx xxxxxxxxxxxxx xxréé (XXX). (Xxxxxxxxéx XXX - Xxx 0000)]]></hhh></fff></record>
</records>
Since the problem can disappear with the addition of a couple of spaces (for example, if the above is beautyfied, it won't cause problem), I've uploaded the files I used for my tests:
Bad file (without BOM)
Bad file (with BOM and a couple of 'x' removed from the content of the <ggg>
tag)
Good file (same as Bad one, less the <ccc>
tag).
You can also remove a couple of 'words' from the Bad file and it will go through.
So, is this really a bug in PHP or am I just missing something?
Upvotes: 2
Views: 747
Reputation: 572
fixed it installing libxml2-dev
with sudo apt-get install libxml2-dev
Upvotes: 0
Reputation: 529
Just to close out this question: As mentioned in my comment, this was a bug in PHP which was recently fix. As far as I can tell, the affected versions of PHP are 5.5.32, 5.5.33, 5.5.34, 5.5.35, 5.6.18, 5.6.19, 5.6.20 and 5.6.21.
Upvotes: 0
Reputation: 21
This is a bug related to libxml2. Upgrade this library to latest version from this URL : https://git.gnome.org/browse/libxml2/
Upvotes: 2