ghosts in my DOMDocument?

Question

Trying to simple xpath that was running, now show only empty nodes.

Source: any XML file. Suppose



   test 
  
    Hello
Bye

I redo all, and here include a complete test:

$dom2 = new DOMDocument;
$dom2->Load($pathFile);
$xpath2 = new DOMXPath($dom2);
$entries = $xpath->query('//p');
// nothing here, all empty:
var_dump($entries);  // zero!
foreach ($entries as $entry) {
    echo "Found {$entry->nodeValue},";
}
// by all here!  
foreach($dom2->getElementsByTagName('*') as $e )
  print "
 name={$e->nodeName}";  // all tags!

What is worng, why xpath is not running?

Peter Krauss · Accepted Answer

It is an old problem with the W3C's DomDocument v1.0 standards. As an old site commented about the XPath-beginners surprise,

One of the commonly asked questions about (...) is:
"Why nothing matched for my XPath expression which seems right to me?"
Common cause of these problems is not properly defining a namespace for XPath.

But beginners are right, is an ugly behaviour for a "default thing"... So let's preserve the beginners good intuition about what is simple and good.

Is horrible to see a XPath that not seems what you need (what XML seems when its tags have no prefix). The tags are simple tags, need simple XPath.

Reliable workaround

Fixing the ugly XPath-query's behaviour with the best solution. It is not trivial because root's xmlns attribute is read-only, so we need re-do DOM object by a new string XML:

$expTag = 'html';  // config expected tag-root
$expNs  = 'http://www.w3.org/1999/xhtml';  // config
// ...
$e = $dom->documentElement; // root node

// Validate input (as expecteds configs) and change tag root:
if ($e->nodeName==$expTag && $e->hasAttribute('xmlns') 
    && $e->getAttribute('xmlns')==$expNs) {
  // can't do $e->removeAttribute('xmlns') because is read-only!
  $xml = $dom->C14N(); // normalize quotes and remove repeateds
  $xml = preg_replace("#^<$expTag (.*?)xmlns="[^"]+"#", "<$expTag\$1", $xml);
  $dom = DOMDocument::LoadXML($xml);
} else 
     die("
 ERROR: something not expected.
");
//...
$xpath = new DOMXPath($dom);
$entries = $xpath->query('//p'); // perfect, now back simple to express XPath!

This solution must be used only when you have no limitations, as in digital preservation contexts.

The problem in other practical contexts is the high cost (CPU) of save/reload the full XML as string, and to be safe, yet more expensive C14N method, that prepares safe XML to the regular expression.

The use of C14N (good also for other things in a digital preservation context) is necessary to ensure the correct behaviour of the regular expression — strictly the getAttribute() method may be affected by an attribute duplication, but we can neglect this "second order" effect, or transfer the checking to the regular expression.

ghosts in my DOMDocument?

Answers (2)

Reliable workaround

Related Questions