Reputation: 338
text = '''\
<html>
<body>
<p><strong>test</strong>TEXT A B </p>
<p><strong>test</strong>TEXT A </p>
<p><strong>test</strong>TEXT B </p>
<p><strong>ok</strong>TEXT A B </p>
<p>TEXT A B </p>
<body>
</html>'''
import lxml.html
root = lxml.html.fromstring(text)
There are three p nodes in html-text, i want to extract <p><strong>test</strong>TEXT A B </p>
as desired.
The features are :
1.the text value of p element contains A
and B
.
2.the text value of p's sub element strong
is test
.
node = root.xpath('.//p[contains(text(),"A") and contains(text(),"B")]')
The above expression will extract the three nodes,i have a try with xpath:
node = root.xpath('.//p[/strong(contains(text(),"test")) and contains(text(),"A") and contains(text(),"B")]')
It is a invalid expression in xpath,how to write the right format?
Upvotes: 1
Views: 114
Reputation: 2469
Try a solution other than XPath, and you may like it, too.
from simplified_scrapy import SimplifiedDoc
html = '''<html>
<body>
<p><strong>test</strong>TEXT A B </p>
<p><strong>test</strong>TEXT A </p>
<p><strong>test</strong>TEXT B </p>
<p><strong>ok</strong>TEXT A B </p>
<p>TEXT A B </p>
<body>
</html>'''
doc = SimplifiedDoc(html)
ps = doc.selects('p').contains(['<strong>test</strong>','A','B'])
print (ps)
Result:
[{'tag': 'p', 'html': '<strong>test</strong>TEXT A B '}]
You can also try the following code to see what is output.
print (doc.selects('p').containsOr(['<strong>test</strong>','<strong>ok</strong>']))
print (doc.selects('p').notContains(['<strong>test</strong>','<strong>ok</strong>']))
Upvotes: 1
Reputation: 22617
A correct XPath expression given your requirements is
//p[contains(., 'A') and contains(., 'B') and strong/text() = 'test']"
Python output
>>> root.xpath("//p[contains(., 'A') and contains(., 'B') and strong/text() = 'test']")
[<Element p at 0x1075031b0>]
Problem with your proposed approaches
Your first solution does not include all conditions (text content of strong
is missing), while the second one includes strong()
(you probably meant strong[]
).
Your second proposed approach can be amended with minimal changes, with the same output:
>>>> root.xpath('//p[strong[contains(text(),"test")] and contains(text(),"A") and contains(text(),"B")]')
[<Element p at 0x1075031b0>]
The difference to my solution above being that I test for the string value .
, while your solution has text()
.
Upvotes: 0