Mads Skjern
Mads Skjern

Reputation: 5890

XPath to look for subtree

I'm scraping an html document, whose structure changes all the time. Css class names even change, so I can't rely on that. However, one thing never changes, the value is always contained in a subtree exactly like the following:

<span>
  <span>
    <span>wanted value</span>
    <span></span>wanted value
  </span>
</span>

Can this be expressed as an XPath expression?

It should not match:

<span>
  <span>
    <span> 1, one too little </span>
    <span> 2 </span>
    <span> 3, one too many </span>
    <span> 4, two too many </span>
  </span>
</span>

I plan to do this using lxml for Python.

Upvotes: 2

Views: 863

Answers (1)

Mark Veenstra
Mark Veenstra

Reputation: 4739

If the location of the wanted value is always on the third level of span an xpath as follows will work:

//span/span/span[1]

When applied on the next HTML document:

<html>
  <head>
    <title>Your Title</title>
  </head>
  <body>
    <div>
    <span>
      <span>
        <span>wanted value</span>
        <span></span>
      </span>
    </span>
    </div>
    <div>
    <span>
      <span>
        <span>wanted value</span>
        <span></span>
      </span>
    </span>
    </div>
  </body>
</html>

The result will be:

wanted value
wanted value

EDIT

If you only want the values of the first span on the third level when the total of spans equals 2 on the third level you can use the following XPath:

//span/span[count(span) = 2]/span[1]

Upvotes: 3

Related Questions