toni rmc
toni rmc

Reputation: 878

XPath - Select Elements Not Containing Elements

I cant seem to find topic which answers this so I'm asking myself.
Since this is generic question for which answer can be applied to most documents, I think specific code example is not necessary.

Using XPath I want to select all table nodes which do not nest other tables.
So no other descendant table elements, and I also want to discard all tables which have spaces only as their value.

I have tried this:

//table[not(child::table) and normalize-space(.)]

but it's not working.

What is the right way to do it?

Upvotes: 2

Views: 5195

Answers (2)

StuartLC
StuartLC

Reputation: 107237

Assuming that you are scraping (X)HTML, and noting that table cannot have another table as a direct child, it is likely that you are looking for descendent table elements, and not direct child elements.

table[not(descendant::table)]

In the Xml below:

<xml>
    <table id="hasDescendent">
        <tr>
            <td>
                <table id="Inner Descendent"/>
            </td>
        </tr>
    </table>
    <table id="directChild">
        <table id="Inner Direct Child" />
    </table>
    <table id="nochild">
    </table>
</xml>

The xpath //table[not(descendant::table)] returns the following tables:

  • Inner Descendent
  • Inner Direct Child
  • nochild

Upvotes: 3

alecxe
alecxe

Reputation: 473763

Let's use the following HTML fragment as an example:

<div>
    <table id="1">

    </table>

    <table id="2">
        <table>
            <tr>
                <td>2</td>
            </tr>
        </table>
    </table>

    <table id="3">
        <div>I'm the one you wanted to find</div>
    </table>
</div>

According to your description, the first table should be discarded since it contains only spaces, the second table should be discarded also, since there is another table inside.

The following xpath expression would match the third table only:

/div/table[(not(child::table) and normalize-space(.))] 

Demo (using xmllint tool):

$ xmllint index.html --xpath '/div/table[(not(child::table) and normalize-space(.))]'
<table id="3">
    <div>I'm the one you wanted to find</div>
</table>

Upvotes: 1

Related Questions