Phrogz
Phrogz

Reputation: 303254

Find all descendant text() nodes except in subsections

My XML document has arbitrarily nested sections. Given a reference to a particular section I need to find all the TextNodes in that section not including subsections.

For example, given a reference to the #a1 node below, I need to find only the "A1 " and "A1" text nodes:

<root>
  <section id="a1">
    <b>A1 <c>A1</c></b>
    <b>A1 <c>A1</c></b>
    <section id="a1.1">
      <b>A1.1 <c>A1.1</c></b>
    </section>
    <section id="a1.2">
      <b>A1.2 <c>A1.2</c></b>
      <section id="a1.2.1">
        <b>A1.2.1</b>
      </section>
      <b>A1.2 <c>A1.2</c></b>
    </section>
  </section>
  <section id="a2">
    <b>A2 <c>A2</c></b>
  </section>
</root>

In case it wasn't obvious, the above is made-up data. The id attributes in particular may not exist in the real-world document.

The best I've come up with for now is to find all text nodes within the section and then use Ruby to subtract out the ones I don't want:

def own_text(node)
  node.xpath('.//text()') - node.xpath('.//section//text()')
end

doc = Nokogiri.XML(mydoc,&:noblanks)
p own_text(doc.at("#a1")).length #=> 4

Can I craft a single XPath 1.0 expression to find these nodes directly? Something like:

.//text()[ancestor::section = self] # self being the original context node

Upvotes: 2

Views: 333

Answers (2)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243449

Use (for the section with id attribute having string value of "a1"):

   //section[@id='a1']
       //*[normalize-space(text()) and ancestor::section[1]/@id = 'a1']/text()

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select=
      "//section[@id='a1']
           //*[normalize-space(text()) and ancestor::section[1]/@id = 'a1']
     "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<root>
    <section id="a1">
        <b>A1 
            <c>A1</c>
        </b>
        <b>A1 
            <c>A1</c>
        </b>
        <section id="a1.1">
            <b>A1.1 
                <c>A1.1</c>
            </b>
        </section>
        <section id="a1.2">
            <b>A1.2 
                <c>A1.2</c>
            </b>
            <section id="a1.2.1">
                <b>A1.2.1</b>
            </section>
            <b>A1.2 
                <c>A1.2</c>
            </b>
        </section>
    </section>
    <section id="a2">
        <b>A2 
            <c>A2</c>
        </b>
    </section>
</root>

It evaluates the XPath expression (selecting just the parents of the wanted text nodes -- in order to have clearly visible results) and copies the selected nodes to the output:

<b>A1 
            <c>A1</c>
</b>
<c>A1</c>
<b>A1 
            <c>A1</c>
</b>
<c>A1</c>

UPDATE: In case the section elements can have same id attributes (or no id attributes at all) use:

       (//section)[1]
           //*[normalize-space(text())
           and
              count(ancestor::section)
             =
               count((//section)[1]/ancestor::section) +1]/text()

XSLT - based verification:

<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
     <xsl:strip-space elements="*"/>

     <xsl:template match="/">
         <xsl:copy-of select=
          "(//section)[1]
               //*[normalize-space(text())
               and
                  count(ancestor::section)
                 =
                   count((//section)[1]/ancestor::section) +1]
         "/>
     </xsl:template>
</xsl:stylesheet>

Transformation result (same):

<b>A1 
            <c>A1</c>
</b>
<c>A1</c>
<b>A1 
            <c>A1</c>
</b>
<c>A1</c>

This selects exactly the same wanted text nodes.

Upvotes: 3

Kirill Polishchuk
Kirill Polishchuk

Reputation: 56162

Use:

//text()[ancestor::section[1]/@id = 'a1']

Upvotes: 1

Related Questions