r_31415
r_31415

Reputation: 8980

How to get siblings when using contains(text(), ) in xpath

I have been introduced to xpath today and it seems to be very powerful but after quite a bit of searching, I haven't found how to retrieve siblings (via following-sibling and preceding-sibling) when contains is being used:

text = """
<html>
  <head>
    <title>This tag includes 'some_text'</title>
    <h2>A h2 tag</h2>
  </head>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
a = doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")

which produces []. Of course, the result I expect is to get the h2 tag.

However, using *[contains(text(),'name')] retrieves as expected, the title element. In the same manner, if instead of using following-sibling axis (I think that's how it's called), I use //parent::*, also works.

So, How can I get the siblings under that condition?

Thanks in advance.

Upvotes: 3

Views: 10750

Answers (4)

Michael Kay
Michael Kay

Reputation: 163625

The key thing here is that your XPath is looking at a tree created by an HTML5 parser, not an XML parser. HTML5 parsers create nodes in the tree that are not explicit in your source: in effect, they repair invalid HTML and turn it into valid HTML. This affects any attempt to navigate an HTML tree, whether you use XPath, JQuery, or direct DOM APIs.

Upvotes: 0

Misha Akovantsev
Misha Akovantsev

Reputation: 1825

Funny html sample you have.

import lxml

text = """                                                       
<html>
  <body>
    <span>This tag includes 'some_text'</span>
    <h2>A h2 tag</h2>
  </body>
</html>
"""

doc = lxml.etree.fromstring(text, parser=lxml.etree.HTMLParser())
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")
# [<Element h2 at 102eee100>]

doc = lxml.html.fromstring(text)
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*")
# [<Element h2 at 102f6f188>]

UPDATE:

Here I don't use html parser with its validation rules, and treat input as just random xml:

text = """                       
<html>
  <head>
    <title>This tag includes 'some_text'</title>
    <h2>A h2 tag</h2>
  </head>
</html>
"""
doc = lxml.etree.fromstring(text)
doc.xpath("//*[contains(text(),'some_text')]/following-sibling::*[1]")
# [<Element h2 at 102eeef70>]

Upvotes: 7

TecBrat
TecBrat

Reputation: 3729

<?xml version="1.0" ?>
  <html>
    <head>
      <title>This tag includes 'some_text'</title>
      <h2>A h2 tag</h2>
    </head>
  </html>
//*[contains(text(),'some_text')]/following-sibling::*
Array
(
    [0] => SimpleXMLElement Object
        (
            [0] => A h2 tag
        )

)

I used PHP SimpleXMLElement, but the xpath should be the same.

Upvotes: 1

user764357
user764357

Reputation:

There are a few things that need to be clarified before answering this:

  1. following-sibling will return ALL following siblings, not just the immediate one. So if there are nodes after the then they will also be returned.
  2. HTML is not XML. While LXML will try and clean the source up for you, if you can't trust the incoming HTML is clean, then your XPaths may fail. Eg. I believe title tags don't need closing tags in HTML, so depending on how broken the source is LXML may incorrectly put the as a child of the , which may break the XPath
  3. Titles can't have child elements, which may influence how LXML is cleaning it up (such as adding a body tag between them, etc...).

Testing this in an XML editor shows your XPath is valid, but i was getting the lack of elements when testing in LXML, which may mean that it is changing the XML some how (but I didn't check).

I would recommend rethinking if XPath is the tool for this job, especially if you are trying to use it for scaping of web pages or similar.

You might also think about rewriting the XPath statement so it is a little more readable as well.

//*[contains(text(),'some_text')]/following-sibling::*

This says: Find me any element that has "some text" in the text, then get the next its following siblings.

//*[preceding-sibling::*[position()=1 and contains(text(),'some_text') and ]]

Whereas this says: Find me the element whose first previous sibling has text that contains "some text".

This may be a style issue, but I find the latter more readable.

Upvotes: 1

Related Questions