Find all descendant text() nodes except in subsections

Question

My XML document has arbitrarily nested sections. Given a reference to a particular section I need to find all the TextNodes in that section not including subsections.

For example, given a reference to the #a1 node below, I need to find only the "A1 " and "A1" text nodes:

In case it wasn't obvious, the above is made-up data. The id attributes in particular may not exist in the real-world document.

The best I've come up with for now is to find all text nodes within the section and then use Ruby to subtract out the ones I don't want:

def own_text(node)
  node.xpath('.//text()') - node.xpath('.//section//text()')
end

doc = Nokogiri.XML(mydoc,&:noblanks)
p own_text(doc.at("#a1")).length #=> 4

Can I craft a single XPath 1.0 expression to find these nodes directly? Something like:

.//text()[ancestor::section = self] # self being the original context node

Dimitre Novatchev · Accepted Answer

Use (for the section with id attribute having string value of "a1"):

   //section[@id='a1']
       //*[normalize-space(text()) and ancestor::section[1]/@id = 'a1']/text()

XSLT - based verification:

When this transformation is applied on the provided XML document:

It evaluates the XPath expression (selecting just the parents of the wanted text nodes -- in order to have clearly visible results) and copies the selected nodes to the output:

UPDATE: In case the section elements can have same id attributes (or no id attributes at all) use:

       (//section)[1]
           //*[normalize-space(text())
           and
              count(ancestor::section)
             =
               count((//section)[1]/ancestor::section) +1]/text()

XSLT - based verification:

Transformation result (same):

This selects exactly the same wanted text nodes.

Find all descendant text() nodes except in subsections

Answers (2)

Related Questions