Reputation: 67
So I use simplexml with the XML parsing in PHP, but I found that some of the files I needed to parse contains errors. Of course, I could manually edit each and every problematic file, but at about 10000+ files, that would take me forever.
Okay, so about the error. When you try to open the XML file in the browser, this message shows up:
Warning: simplexml_load_string(): Entity: line 2: parser error : Specification mandate value for attribute Inspection in ...
I found the ff. tag is triggering the error (here: Transport instead of Inspection):
<Public Transport Rules>
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</Files>
</PublicTransport>
</Public Transport Rules>
The spaces within the tags is causing the issue, apparently. And these tags occur more than once in the file.
I think that simplexml
parses by what it sees in the browser (at face value), so if there is a problem with your XML file, it wont be able to parse normally. I thought of making PHP parse by reading the source file instead, and perhaps editing the file from there. But it seems any fopens
opens to what you read in the browser page.
Been stuck with this problem for a while now. Any advice would be appreciated.
Thanks!
Upvotes: 2
Views: 4697
Reputation: 197554
If you can live with a renaming of the tag that has the spaces, tidy is a good option as it works on XML, too:
$xml = simplexml_load_string(
tidy_repair_string($string, ['input-xml' => 1])
);
echo "SimpleXML::asXML():\n", $xml->asXML(), "\n\n";
It renames the tag and creates attributes:
SimpleXML::asXML():
<?xml version="1.0"?>
<Public Transport="" Rules="">
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile> <location>Citybus</location>
<format>Events</format> </localfile></Files>
</PublicTransport>
</Public>
There are also more options for indentation etc., here a full example:
<?php
/**
* How to parse XML files with errors using Simplexml in PHP?
*
* @link http://stackoverflow.com/q/15620492/367456
*/
$string = '<?xml version="1.0" ?>
<Public Transport Rules>
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</Files>
</PublicTransport>
</Public Transport Rules>';
echo "Broken:\n", $string, "\n\n";
$fixed = tidy_repair_string($string, ['input-xml' => 1, 'output-xml' => 1, 'indent' => 1]);
echo "Fixed:\n", $fixed, "\n\n";
$xml = simplexml_load_string(tidy_repair_string($string, ['input-xml' => 1]));
echo "SimpleXML::asXML():\n", $xml->asXML(), "\n\n";
And output:
Broken:
<?xml version="1.0" ?>
<Public Transport Rules>
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</Files>
</PublicTransport>
</Public Transport Rules>
Fixed:
<?xml version="1.0"?>
<Public Transport="" Rules="">
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format> </localfile></Files>
</PublicTransport>
</Public>
SimpleXML::asXML():
<?xml version="1.0"?>
<Public Transport="" Rules="">
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile> <location>Citybus</location>
<format>Events</format> </localfile></Files>
</PublicTransport>
</Public>
Upvotes: 2
Reputation: 146340
DOM functions are designed to deal with invalid markup so you can give them a try:
<?php
$string = '<?xml version="1.0" ?>
<Public Transport Rules>
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</Files>
</PublicTransport>
</Public>';
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($string);
libxml_use_internal_errors(FALSE);
$dom->formatOutput = TRUE;
echo '::: Original XML :::' . PHP_EOL;
echo $string . PHP_EOL;
echo PHP_EOL;
echo '::: Fixed XML :::' . PHP_EOL;
if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
$body = $dom->documentElement->firstChild;
if( $body->hasChildNodes() ){
foreach($body->childNodes as $node){
echo $dom->saveHTML($node);
}
}
}else{
$body = $dom->getElementsByTagName('body')->item(0);
if( $body->hasChildNodes() ){
foreach($body->childNodes as $node){
echo $dom->saveHTML($node);
}
}
}
echo PHP_EOL;
... prints this:
::: Original XML :::
<?xml version="1.0" ?>
<Public Transport Rules>
<PublicTransport id="0">
<Issued>null</Issued>
<Files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</Files>
</PublicTransport>
</Public>
::: Fixed XML :::
<public transport rules><publictransport id="0"><issued>null</issued><files><localfile>
<location>Citybus</location>
<format>Events</format>
</localfile>
</files></publictransport></public>
There's no way to know what will be lost in the process but we're dealing with invalid data in the first place.
Whatever, you can always automatically edit each and every problematic file using PHP. Your files may not be XML but they're indeed strings ;-)
Upvotes: 1