Adam Ranganathan
Adam Ranganathan

Reputation: 1717

How to select the text() immediately following an element conditionally in XPath?

I have the following structure where the child nodes are in random order:

<span id="outer">
     <div style="color:blue">51</div>
     <span class="main">Gill</span>$500
     <span style="color:red">11</span>
     <span></span>James
     <div style="color:red">158</div>
     <div class="sub">Mary</div>
</span>

I am trying to concatenate strings together (leaving a space in between) based on conditions:

  1. If style color is "blue" then add node value to string
  2. If class is "main" then add node value to string
  3. All text() not enclosed in tags will be added to string but in the order of traversal of all the child nodes.

The example output for the above structure should be:

51 Gill $500 James

I have written the following in PHP to traverse the elements. One may skip reading this part if it is verbose. The main focus is on the $expression to select text() node values if it is immediately occurring after an element:

$nodes = $xpath->query("//span[@id='outer']/*");
$str_out = "";
foreach($nodes as $node)
{
    if($node->hasAttribute('class')
    {
        if($node->getAttribute('class')=="main")
            $str_out .= $node->nodeValue . " ";
    }

    else if($node->hasAttribute('style')
    {
        $node_style = $node->getAttribute('style');
        preg_match('~color:(.*)~', $node_style, $temp);
        if( $temp[1] == "red" )
            $str_out .= $node->nodeValue . " ";
    }

    // Now evaluate if the IMMEDIATELY next sibling is text()

    $next_node = $xpath->query('.//following-sibling::*[1]', $node);        
    if($next_node->length)
    {
        $next_node = $next_node->item(0);
        $next_node_name = $next_node->nodeName;         
        $next_node_value =  $next_node->nodeValue;
        $current_node_name = $node->nodeName;

        $expression = ".//following-sibling::text()[1][preceding-sibling::".$current_node_name." and following-sibling::".$next_node_name."[contains(text(),'".$next_node_value."')]]";

        $text_node = $xpath->query($expression, $node);
        if($text_node->length)              
        {           
            $str_out .= $text_node->item(0)->nodeValue . " ";               
        }
    }
}
echo $str_out;

The main focus, as mentioned earlier, is to capture the text() node values if is immediately occurring after an element. I want to write an XPATH expression that does the following: 1. Select the first text() node after an element 2. Check if this text() node is in between the self node (present node) and the immediately following node.

For example in this block:

<span></span>James
<div style="color:red">158</div>

James is in between the span and div nodes. So we add it to the string.

But in this block:

<span style="color:red">11</span>
<span></span>James
<div style="color:red">158</div>

James would still be selected by following-sibling[1] statement relative to the first span element (with color:red)

This should NOT be added.

Please see my $expression in the PHP code where I am trying to capture this process but it is not working.

$expression = ".//following-sibling::text()[1][preceding-sibling::".$current_node_name." and following-sibling::".$next_node_name."[contains(text(),'".$next_node_value."')]]";

Upvotes: 1

Views: 2669

Answers (2)

ThW
ThW

Reputation: 19502

Xpath supports axes. Using them you can specify which nodes will be matches initially. The default axis is child and the @ is short for attribute. The axes you're needing in this case are following-sibling and self.

If you're using span[@class = "main"] to specify the marker node, you can extend it to span[@class = "main"]/following-sibling::node()[1] and fetch the following node. To make sure that it is a text node with span[@class = "main"]/following-sibling::node()[1]/self::text()

At the moment you're iterating all nodes, but except for the style attributes, you can match the values directly in Xpath. And for the style conditions you can use a callback into PHP:

$xml = <<<'XML'
<span id="outer">
     <div style="color:blue">51</div>
     <span class="main">Gill</span>$500
     <span style="color:red">11</span>
     <span></span>James
     <div style="color:red">158</div>
     <div class="sub">Mary</div>
</span>
XML;

function getStyleProperty($node, $name) { 
  if (is_array($node)) {
    $node = $node[0];
  }
  if ($node instanceof DOMElement) {
    $pattern = sprintf(
    '(\b%s:\s*([^;]*)\s*(;|$))', preg_quote($name)
    );
    if (preg_match($pattern, $node->getAttribute('style'), $matches)) {
      return $matches[1];
    }
  }
  return '';
}

$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('php', 'http://php.net/xpath');
$xpath->registerPHPFunctions(['getStyleProperty']);

foreach ($xpath->evaluate('//span[@id="outer"]')as $outer) {
  var_dump(
    $xpath->evaluate('string(div[php:function("getStyleProperty", ., "color") = "blue"])', $outer),
    $xpath->evaluate('string(span[@class = "main"])', $outer),
    $xpath->evaluate('string(span[@class = "main"]/following-sibling::text()[1])', $outer),
    $xpath->evaluate('string(span[not(@class or @style)]/following-sibling::node()[1]/self::text())', $outer)
  );
}

Output:

string(2) "51"
string(4) "Gill"
string(10) "$500
     "
string(11) "James
     "

Upvotes: 0

Keith Hall
Keith Hall

Reputation: 16095

You can achieve this with the following:

<?php
$xmldoc = new DOMDocument();
$xmldoc->loadXML(<<<XML
<span id="outer">
     <div style="color:blue">51</div>
     <span class="main">Gill</span>$500
     <span style="color:red">11</span>
     <span></span>James
     <div style="color:red">158</div>
     <div class="sub">Mary</div>
</span>
XML
);
$xpath = new Domxpath($xmldoc);

$nodes = $xpath->query("//span[@id='outer']/*");
$str_out = "";
foreach ($nodes as $node)
{
    if ($node->hasAttribute('class'))
    {
        if ($node->getAttribute('class') == "main")
            $str_out .= $node->nodeValue . " ";
    }

    else if ($node->hasAttribute('style'))
    {
        $node_style = $node->getAttribute('style');
        preg_match('~color:(.*)~', $node_style, $temp);
        if ($temp[1] == "blue")
            $str_out .= $node->nodeValue . " ";
    }

    // Now evaluate if the IMMEDIATELY next sibling is text()
    $next_node = $xpath->query('./following-sibling::node()[1]/self::text()[normalize-space()]', $node);
    if ($next_node->length)
    {
        $str_out .= trim($next_node->item(0)->nodeValue) . " ";
    }
}
echo $str_out;

The XPath query:

./following-sibling::node()[1]/self::text()[normalize-space()]

says:

  • . from the context node
  • following-sibling::node()[1] take the first following sibling node (whether it be a text node or an element (or even a comment))
  • self::text()[normalize-space()] take the "current" node if it is a text node and doesn't consist of only whitespace

Output is:

51 Gill $500 James

This will also handle the scenario where you could have a text node after the last child element of the parent <span id="outer">.

Upvotes: 0

Related Questions