Vladimir Sirovitskiy
Vladimir Sirovitskiy

Reputation: 25

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:

<div class="document">

    <div class="title">
        <AAA/>
    </div class="title">

    <div class="lead">
        <BBB/>
    </div class="lead">

    <div class="photo">
        <CCC/>
    </div class="photo"> 

    <div class="text">
    <!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
        <DDD>
            <EEE/>
            <DDD/>
            <CCC/>
            <FFF/>
                <FFF>
                    <GGG/>
                </FFF>
        </DDD>
    </div class="text">

    <div class="more_text">
        <DDD>
        <EEE/>
            <DDD/>
            <CCC/>
            <FFF/>
                <FFF>
                    <GGG/>
                </FFF>
        </DDD>
    </div class="more_text">

    <div class="other_stuff">
        <DDD/>
    </div class="other_stuff">

</div class="document">

The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.

The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[@class="lead"]/following::* and $ns2 with //*[@class="other_stuff"]/preceding::*, the working code looks like this:

//*[@class="lead"]/following::*[count(. | //*[@class="other_stuff"]/preceding::*)
= count(//*[@class="other_stuff"]/preceding::*)]/text()

It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself

//*[@class="lead" and not(@class="photo ")]/following::*
//*[@class="lead"]/following::*[not(@class="photo ")]
//*[@class="lead"]/following::*[not(self::class="photo ")]

(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.

Question 1: How to exclude the unnecessary element from this intersection?

It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.

Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?

It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[@class="lead"]/following::[@class="other_stuff"] but it doesn’t seem to work.

Upvotes: 1

Views: 364

Answers (1)

har07
har07

Reputation: 89285

Question 1: How to exclude the unnecessary element from this intersection?

Adding another predicate, [not(self::div[@class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :

//*[@class="lead"]
 /following::*[
    count(. | //*[@class="other_stuff"]/preceding::*) 
        = 
    count(//*[@class="other_stuff"]/preceding::*)
 ][not(self::div[@class='photo'])]
/text()

Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?

I'm not sure if it would be 'better', what I can tell is following::[@class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[@class="other_stuff"], or just 'div' following::div[@class="other_stuff"].

Upvotes: 1

Related Questions