MITHU
MITHU

Reputation: 154

Following sibling within an xpath is not working as intended

I've been trying to scoop out a portion of text out of some html elements using xapth but It seems I'm going somewhere wrong that is why I can't make it.

Html elements:

htmlelem = """
<div class="content">
    <p>Type of cuisine: </p>International
</div>
"""

I would like to dig out International using xpath. I know I could get success using .next_sibling If I wanted to extract the same using css selector but I'm not interested in going that route.

That said If I try like this I can get the same using xpath:

tree.xpath("//*[@class='content']/p/following::text()")[0]

But the above expression is not what I'm after cause I can't use the same within selenium webdriver If I stick to driver.find_element_by_xpath()

The only way that I'm interested in is like the following but it is not working:

"//*[@class='content']/p/following::*"

Real-life example:

from lxml.html import fromstring

htmlelem = """
<div class="content">
    <p>Type of cuisine: </p>International
</div>
"""
tree = fromstring(htmlelem)
item = tree.xpath("//*[@class='content']/p/following::text()")[0].strip()
elem = tree.xpath("//*[@class='content']/p/following::*")[0].text
print(elem)

In the above example, I can get success printing item but can't printing elem. However, I would like to modify the expression used within elem.

How can I make it work so that the same xpath I can use within lxml library or within selenium?

Upvotes: 2

Views: 283

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

Since OP was looking for a solution which extracts the text from outside the xpath, the following should do that, albeit in a somewhat awkward manner:

tree.xpath("//*[@class='content']")[0][0].tail

Output:

International

The need for this approach is a result of the way lxml parses the html code: tree.xpath("//*[@class='content']") results in a list of length=1. The first (and only) element in the list - tree.xpath("//*[@class='content']")[0] is a lxml.html.HtmlElement which itself can be treated as a list and also has length=1.

In the tail of the first (and only) element in that lxml.html.HtmlElement hides the desired output...

Upvotes: 2

Related Questions