valerij vasilcenko
valerij vasilcenko

Reputation: 258

xpath exclude element and all its children by parent attribute containing a value

Example of a markup:

<div class="post-content">
    <p>
        <moredepth>
            <...>
                <span class="image-container float_right">
                    <div class="some_element">
                        image1
                    </div>
                    <p>do not need this</p>
                </span>
                <div class="image-container float_right">
                    image2
                </div>
                <p>text1</p>
                <li>text2</li>
            </...>
        </moredepth>
    </p>
</div>

Worst part is that depth of "image-container" can be on any level.

Xpath I try to use:

//div[contains(@class, 'post-content')]//*[not(contains(@class, 'image-container'))]

What Xpath should I use to be able to exclude "some_element" and any other children of "image-container" of any depth and an "image-container" element itself?

Output in this example should be:

<p>
    <moredepth>
        <...>

            <p>text1</p>
            <li>text2</li>
        </...>
    </moredepth>
</p>

P.S. Is it possible to make such a selection using CSS?

Upvotes: 6

Views: 4255

Answers (2)

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

XPath does not allow manipulating a fragment of XML once it is returned to you by a path expression. So, you cannot select moredepth:

//moredepth

without getting as a result all of this element node, including all descendant nodes that you'd like to exclude:

<moredepth>
<span class="image-container float_right">
<div class="some_element">
image1
</div>
<p>do not need this</p>
</span>
<div class="image-container float_right">
image2
</div>
<p>text1</p>
<li>text2</li>
</moredepth>

What you can do is only select the child nodes of moredepth:

//div[contains(@class, 'post-content')]/p/moredepth/*[not(contains(@class,'image-container'))]

which will yield (individual results separated by -------):

<p>text1</p>
-----------------------
<li>text2</li>

Upvotes: 3

helderdarocha
helderdarocha

Reputation: 23637

You can apply the Kaysian method for obtaining the intersection of a set. You have two sets:

A: The elements which descend from //div[contains(@class, 'post-content')], excluding the current element (since you don't want the root div):

//*[ancestor::div[contains(@class, 'post-content')]]

B: The elements which descend from //*[not(contains(@class, 'image-container'))], including the current element (since you want to exclude the entire tree, including the div and span):

//*[not(ancestor-or-self::*[contains(@class, 'image-container')])] 

The intersection of those two sets is the solution to your problem. The formula of the Kaysian method is: A [ count(. | B) = count(B) ]. Applying that to your problem, the result you need is:

//*[ancestor::div[contains(@class, 'post-content')]]
   [ count(. | //*[not(ancestor-or-self::*[contains(@class, 'image-container')])])
     = 
     count(//*[not(ancestor-or-self::*[contains(@class, 'image-container')])]) ]

This will select the following elements from your example code:

/div/p
/div/p/moredepth
/div/p/moredepth/...
/div/p/moredepth/.../p
/div/p/moredepth/.../li

excluding the span and the div that match the unwanted class, and its descendants.

You can then add extra steps to the expression to filter out exactly which text or nodes you need.

Upvotes: 5

Related Questions