Qualia
Qualia

Reputation: 61

Exclude nodes between elements with an extra twist

Have a tricky XPath issue that I can't quite seem to get. Let's say I have the following:

<content>
  <body>
      <block id="123">
        <html>
          <p align="left">Some text</p>
        </html>
      </block>
      <block id="abc8383">
        <html>
          <p></p>
        </html>
      </block>
      <block id="456">
        <html>
          <p><span>Some more text</span></p>
         </html>
      </block>
      <block id="789">
        <html>
          <p></p>
        </html>
      </block>  
      <block id="012356">
        <html>
          <p class="finalBlock"><h3>content</h3><span>xyz</span></p>
        </html>
      </block>  
  </body>
</content>

I want to select all nodes above the element which has a p tag inside the xhtml with a "finalBlock" class, except for the ones that do not have context (node text - e.g. block id 789). However, this rule should only apply until the first node with content is encountered again - afterwards the empty elements should all be included. This means that the input above should produce the following output:

<content>
  <body>
      <block id="123">
        <html>
          <p align="left">Some text</p>
        </html>
      </block>
      <block id="abc8383">
        <html>
          <p></p>
        </html>
      </block>
      <block id="456">
        <html>
          <p><span>Some more text</span></p>
         </html>
      </block>  
      <block id="012356">
        <html>
          <p class="finalBlock"><h3>content</h3><span>xyz</span></p>
        </html>
      </block>  
  </body>
</content>

Where the element with an id of 789 was removed, but all others were kept. I've managed to craft the XPath query that excludes the block elements I want (empty ones), but am struggling with implementing the "between" rule. Any thoughts would be greatly appreciated!

Here's the expression excluding the empty block elements

//block[html/p]/html/p[normalize-space(.) != '']

Upvotes: 2

Views: 105

Answers (1)

helderdarocha
helderdarocha

Reputation: 23637

This expression selects "the element which has a p tag inside the html, with a finalBlock class", which is <block id="012356">:

//*[html/p[@class='finalBlock']]

This one selects all the block nodes that precede it ("all nodes above" - which does not include the ancestor nodes):

//*[html/p[@class='finalBlock']]/preceding-sibling::*

You can add a predicate to restrict that to only the ones that have a non-empty p descendant:

//*[html/p[@class='finalBlock']]/preceding-sibling::*[descendant::p[string()]]

And the ones that have an empty p descendant, except the most recent one:

//*[html/p[@class='finalBlock']]/preceding-sibling::*[descendant::p[not(string())]][not(position() = 1)]

If you perform a union of the previous two expressions, you will obtain all the block nodes that satisfy the requirements you stated:

//*[html/p[@class='finalBlock']]/preceding-sibling::*[descendant::p[string()]] 
| //*[html/p[@class='finalBlock']]/preceding-sibling::*[descendant::p[not(string())]][not(position() = 1)]

Upvotes: 1

Related Questions