Reputation: 303254
My XML document has arbitrarily nested sections. Given a reference to a particular section I need to find all the TextNode
s in that section not including subsections.
For example, given a reference to the #a1
node below, I need to find only the "A1 " and "A1" text nodes:
<root>
<section id="a1">
<b>A1 <c>A1</c></b>
<b>A1 <c>A1</c></b>
<section id="a1.1">
<b>A1.1 <c>A1.1</c></b>
</section>
<section id="a1.2">
<b>A1.2 <c>A1.2</c></b>
<section id="a1.2.1">
<b>A1.2.1</b>
</section>
<b>A1.2 <c>A1.2</c></b>
</section>
</section>
<section id="a2">
<b>A2 <c>A2</c></b>
</section>
</root>
In case it wasn't obvious, the above is made-up data. The id
attributes in particular may not exist in the real-world document.
The best I've come up with for now is to find all text nodes within the section and then use Ruby to subtract out the ones I don't want:
def own_text(node)
node.xpath('.//text()') - node.xpath('.//section//text()')
end
doc = Nokogiri.XML(mydoc,&:noblanks)
p own_text(doc.at("#a1")).length #=> 4
Can I craft a single XPath 1.0 expression to find these nodes directly? Something like:
.//text()[ancestor::section = self] # self being the original context node
Upvotes: 2
Views: 333
Reputation: 243449
Use (for the section with id
attribute having string value of "a1"):
//section[@id='a1']
//*[normalize-space(text()) and ancestor::section[1]/@id = 'a1']/text()
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//section[@id='a1']
//*[normalize-space(text()) and ancestor::section[1]/@id = 'a1']
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<root>
<section id="a1">
<b>A1
<c>A1</c>
</b>
<b>A1
<c>A1</c>
</b>
<section id="a1.1">
<b>A1.1
<c>A1.1</c>
</b>
</section>
<section id="a1.2">
<b>A1.2
<c>A1.2</c>
</b>
<section id="a1.2.1">
<b>A1.2.1</b>
</section>
<b>A1.2
<c>A1.2</c>
</b>
</section>
</section>
<section id="a2">
<b>A2
<c>A2</c>
</b>
</section>
</root>
It evaluates the XPath expression (selecting just the parents of the wanted text nodes -- in order to have clearly visible results) and copies the selected nodes to the output:
<b>A1
<c>A1</c>
</b>
<c>A1</c>
<b>A1
<c>A1</c>
</b>
<c>A1</c>
UPDATE: In case the section
elements can have same id
attributes (or no id
attributes at all) use:
(//section)[1]
//*[normalize-space(text())
and
count(ancestor::section)
=
count((//section)[1]/ancestor::section) +1]/text()
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//section)[1]
//*[normalize-space(text())
and
count(ancestor::section)
=
count((//section)[1]/ancestor::section) +1]
"/>
</xsl:template>
</xsl:stylesheet>
Transformation result (same):
<b>A1
<c>A1</c>
</b>
<c>A1</c>
<b>A1
<c>A1</c>
</b>
<c>A1</c>
This selects exactly the same wanted text nodes.
Upvotes: 3