user11092881
user11092881

Reputation:

Deleting multiple namespaces temporarily without saving to file in PHP?

So the following code doesn't work, but it's mainly because of the namespaces at the root element of the file I am trying to parse. I would like to delete the XML namespaces temporarily without saving the changes to the file.

$fxml = "{$this->path}/input.xml";

if (file_exists($fxml))  {    
  $xml = simplexml_load_file($fxml);
  $fs = fopen("{$this->path}/output.csv", 'w');
  $xml->registerXPathNamespace('e', 'http://www.sitemaps.org/schemas/sitemap/0.9');
  $fieldDefs = [
      'url'                => 'url',
      'id'                 => 'id',
  ];
  fputcsv($fs, array_keys($fieldDefs));
  foreach ($xml->xpath('//e:urlset') as $url) {

      $fields = [];
      foreach ($fieldDefs as $fieldDef) {
          $fields[] = $url->xpath('e:'. $fieldDef)[0];
      }
      fputcsv($fs, $fields);      
      fclose($fs);  
  }
}

So this script fails and gives out an empty csv when I have the following XML. It doesn't work when I have 1 namespace registered in the root element.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>https://www.mywebsite.com/id/2111</loc>
  <id>903660</id>
 </url>
 <url>
  <loc>https://www.mywebsite.com/id/211</loc>
  <id>911121</id>
 </url>
</urlset>

The issue is that I have two namespaces registered in the root element. Is there a way to remove the namespaces to make processing simpler?


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
 <url>
  <loc>https://www.mywebsite.com/id/2111</loc>
  <id>903660</id>
 </url>
 <url>
  <loc>https://www.mywebsite.com/id/211</loc>
  <id>911121</id>
 </url>
</urlset>

Upvotes: 1

Views: 43

Answers (2)

Parfait
Parfait

Reputation: 107652

You actually need to call registerXPathNamespace at every level that runs xpath. However, consider a simpler approach by avoiding the bookkeeping of $fields array and directly cast XPath array to base array:

// LOAD XML
$xml = simplexml_load_file($fxml);
  
// OUTER PARSE XML
$xml->registerXPathNamespace('e', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$urls = $xml->xpath('//e:url');

// INITIALIZE CSV
$fs = fopen('output.csv', 'w');

// WRITE HEADERS
$headers = array_keys((array)$urls[0]);
fputcsv($fs, $headers);

// INNER PARSE XML
foreach($urls as $url) {
   // WRITE ROWS
   fputcsv($fs, (array)$url); 
}

fclose($fs);

Upvotes: 0

ThW
ThW

Reputation: 19502

You would need the delete the namespace definitions and prefixes before loading the XML. This would modify the meaning of the nodes and possibly break the XML. However it is not needed.

The problem with SimpleXMLElement is that you need to re-register the namespaces on any instance you like to call xpath() on. Put that part in a small helper class and you're fine:

class SimpleXMLNamespaces {
    private $_namespaces;
    
    public function __construct(array $namespaces) {
        $this->_namespaces = $namespaces;
    }
    
    function registerOn(SimpleXMLElement $target) {
        foreach ($this->_namespaces as $prefix => $uri) {
            $target->registerXpathNamespace($prefix, $uri);
        }
    } 
}

You already have a mapping array for the field definitions. Put the full Xpath expression for the fields into it:

$xmlns = new SimpleXMLNamespaces(
    [
      'sitemap' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
      'xhtml' => 'http://www.w3.org/1999/xhtml',
    ]
);
$urlset = new SimpleXMLElement($xml);
$xmlns->registerOn($urlset);

$columns = [
  'url' => 'sitemap:loc',
  'id' => 'sitemap:id',
];
$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));

foreach ($urlset->xpath('//sitemap:url') as $url) {
    $xmlns->registerOn($url);
    
    $row = [];
    foreach ($columns as $expression) {
       $row[] = (string)($url->xpath($expression)[0] ?? '');
    }
    fputcsv($fs, $row);
}

Output:

url,id
https://www.mywebsite.com/id/2111,903660
https://www.mywebsite.com/id/211,911121

Or use DOM. DOM has a separate class/object for Xpath that stores the namespace registration so the re-register is not needed. Additionally DOMXpath::evaluate() allows for Xpath expressions that return scalar values directly.

// boostrap DOM + Xpath
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('sitemap', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');

// include string cast in the Xpath expression
// it will return an empty string if it doesn't match
$columns = [
    'url' => 'string(sitemap:loc)',
    'id' => 'string(sitemap:id)',
];

$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));

// iterate the url elements
foreach ($xpath->evaluate('//sitemap:url') as $url) {
    $row = [];
    foreach ($columns as $expression) {
        // evaluate xpath expression for column
        $row[] = $xpath->evaluate($expression, $url);
    }
    fputcsv($fs, $row);
}

Sitemaps are typically large, to avoid the memory consumption you can use XMLReader+DOM.

// define a list of used namespaces
$xmlns = [
    'sitemap' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
    'xhtml' => 'http://www.w3.org/1999/xhtml'
];

// create a DOM document for node expansion + xpath expressions
$document = new DOMDocument();
$xpath = new DOMXpath($document);
foreach ($xmlns as $prefix => $namespaceURI) {
    $xpath->registerNamespace($prefix, $namespaceURI);
}

// open the XML for reading
$reader = new XMLReader();
$reader->open($xmlUri);

// go to the first url element in the sitemap namespace
while (
    $reader->read() &&
    (
        $reader->localName !== 'url' ||
        $reader->namespaceURI !== $xmlns['sitemap']
    )
) {
    continue;
}

$columns = [
    'url' => 'string(sitemap:loc)',
    'id' => 'string(sitemap:id)',
];

$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));

// check the current node is an url
while ($reader->localName === 'url') {
    // in the sitemap namespace
    if ($reader->namespaceURI === $xmlns['sitemap']) {
        // expand node to DOM for Xpath
        $url = $reader->expand($document);
        $row = [];
        foreach ($columns as $expression) {
            // evaluate xpath expression for column
            $row[] = $xpath->evaluate($expression, $url);
        }
        fputcsv($fs, $row);
    }
    // goto next url sibling node
    $reader->next('url');
}
$reader->close();

Upvotes: 0

Related Questions