Reputation: 5890
I'm scraping an html document, whose structure changes all the time. Css class names even change, so I can't rely on that. However, one thing never changes, the value is always contained in a subtree exactly like the following:
<span>
<span>
<span>wanted value</span>
<span></span>wanted value
</span>
</span>
Can this be expressed as an XPath expression?
It should not match:
<span>
<span>
<span> 1, one too little </span>
<span> 2 </span>
<span> 3, one too many </span>
<span> 4, two too many </span>
</span>
</span>
I plan to do this using lxml for Python.
Upvotes: 2
Views: 863
Reputation: 4739
If the location of the wanted value is always on the third level of span an xpath as follows will work:
//span/span/span[1]
When applied on the next HTML document:
<html>
<head>
<title>Your Title</title>
</head>
<body>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
<div>
<span>
<span>
<span>wanted value</span>
<span></span>
</span>
</span>
</div>
</body>
</html>
The result will be:
wanted value
wanted value
EDIT
If you only want the values of the first span on the third level when the total of spans equals 2 on the third level you can use the following XPath:
//span/span[count(span) = 2]/span[1]
Upvotes: 3