Reputation: 8980
I have been introduced to xpath today and it seems to be very powerful but after quite a bit of searching, I haven't found how to retrieve siblings (via following-sibling and preceding-sibling) when contains is being used:
text = """
<html>
<head>
<title>This tag includes 'some_text'</title>
<h2>A h2 tag</h2>
</head>
</html>
"""
import lxml.html
doc = lxml.html.fromstring(text)
a = doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")
which produces []
. Of course, the result I expect is to get the h2 tag.
However, using *[contains(text(),'name')]
retrieves as expected, the title
element. In the same manner, if instead of using following-sibling axis (I think that's how it's called), I use //parent::*
, also works.
So, How can I get the siblings under that condition?
Thanks in advance.
Upvotes: 3
Views: 10750
Reputation: 163625
The key thing here is that your XPath is looking at a tree created by an HTML5 parser, not an XML parser. HTML5 parsers create nodes in the tree that are not explicit in your source: in effect, they repair invalid HTML and turn it into valid HTML. This affects any attempt to navigate an HTML tree, whether you use XPath, JQuery, or direct DOM APIs.
Upvotes: 0
Reputation: 1825
Funny html sample you have.
import lxml
text = """
<html>
<body>
<span>This tag includes 'some_text'</span>
<h2>A h2 tag</h2>
</body>
</html>
"""
doc = lxml.etree.fromstring(text, parser=lxml.etree.HTMLParser())
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")
# [<Element h2 at 102eee100>]
doc = lxml.html.fromstring(text)
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")
# [<Element h2 at 102f6f188>]
UPDATE:
Here I don't use html
parser with its validation rules, and treat input as just random xml:
text = """
<html>
<head>
<title>This tag includes 'some_text'</title>
<h2>A h2 tag</h2>
</head>
</html>
"""
doc = lxml.etree.fromstring(text)
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*[1]")
# [<Element h2 at 102eeef70>]
Upvotes: 7
Reputation: 3729
<?xml version="1.0" ?>
<html>
<head>
<title>This tag includes 'some_text'</title>
<h2>A h2 tag</h2>
</head>
</html>
//*[contains(text(),'some_text')]/following-sibling::*
Array
(
[0] => SimpleXMLElement Object
(
[0] => A h2 tag
)
)
I used PHP SimpleXMLElement, but the xpath should be the same.
Upvotes: 1
Reputation:
There are a few things that need to be clarified before answering this:
Testing this in an XML editor shows your XPath is valid, but i was getting the lack of elements when testing in LXML, which may mean that it is changing the XML some how (but I didn't check).
I would recommend rethinking if XPath is the tool for this job, especially if you are trying to use it for scaping of web pages or similar.
You might also think about rewriting the XPath statement so it is a little more readable as well.
//*[contains(text(),'some_text')]/following-sibling::*
This says: Find me any element that has "some text" in the text, then get the next its following siblings.
//*[preceding-sibling::*[position()=1 and contains(text(),'some_text') and ]]
Whereas this says: Find me the element whose first previous sibling has text that contains "some text".
This may be a style issue, but I find the latter more readable.
Upvotes: 1