showkey
showkey

Reputation: 338

How to write the xpath expression?

text = '''\
<html>
    <body>
        <p><strong>test</strong>TEXT A B </p>
        <p><strong>test</strong>TEXT A </p>
        <p><strong>test</strong>TEXT B </p>
        <p><strong>ok</strong>TEXT A B </p>
        <p>TEXT A B </p>
    <body>
</html>'''
import lxml.html
root = lxml.html.fromstring(text)

There are three p nodes in html-text, i want to extract <p><strong>test</strong>TEXT A B </p> as desired.

The features are :

1.the text value of p element contains A and B.
2.the text value of p's sub element strong is test.

node = root.xpath('.//p[contains(text(),"A") and contains(text(),"B")]')

The above expression will extract the three nodes,i have a try with xpath:

node = root.xpath('.//p[/strong(contains(text(),"test")) and contains(text(),"A") and contains(text(),"B")]')

It is a invalid expression in xpath,how to write the right format?

Upvotes: 1

Views: 114

Answers (2)

dabingsou
dabingsou

Reputation: 2469

Try a solution other than XPath, and you may like it, too.

from simplified_scrapy import SimplifiedDoc
html = '''<html>
    <body>
        <p><strong>test</strong>TEXT A B </p>
        <p><strong>test</strong>TEXT A </p>
        <p><strong>test</strong>TEXT B </p>
        <p><strong>ok</strong>TEXT A B </p>
        <p>TEXT A B </p>
    <body>
</html>'''
doc = SimplifiedDoc(html)
ps = doc.selects('p').contains(['<strong>test</strong>','A','B'])
print (ps)

Result:

[{'tag': 'p', 'html': '<strong>test</strong>TEXT A B '}]

You can also try the following code to see what is output.

print (doc.selects('p').containsOr(['<strong>test</strong>','<strong>ok</strong>']))
print (doc.selects('p').notContains(['<strong>test</strong>','<strong>ok</strong>']))

Upvotes: 1

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

A correct XPath expression given your requirements is

//p[contains(., 'A') and contains(., 'B') and strong/text() = 'test']"

Python output

>>> root.xpath("//p[contains(., 'A') and contains(., 'B') and strong/text() = 'test']")
[<Element p at 0x1075031b0>]

Problem with your proposed approaches

Your first solution does not include all conditions (text content of strong is missing), while the second one includes strong() (you probably meant strong[]).

Your second proposed approach can be amended with minimal changes, with the same output:

>>>> root.xpath('//p[strong[contains(text(),"test")] and contains(text(),"A") and contains(text(),"B")]')
[<Element p at 0x1075031b0>]

The difference to my solution above being that I test for the string value ., while your solution has text().

Upvotes: 0

Related Questions