Reputation: 12302
I have the following html to scrap
<h2>
<span id="title">Title</span>
</h2>
<p>Content 1</p>
<p>Content 2</p>
<p>Content 3</p>
<p>Content 4</p>
<h2>Some other header</h2>
<p>Do not want this content</p>
What I want to select is a series of 4 <p>
tags after the title, and ignore everything else as soon as a non <p>
tag is encountered.
So far my xpath is //h2[span[@id='title']]/following-sibling::p
, but this also includes unwanted
tags.
I also tried the preceding-sibling approach with no luck //p[preceding-sibling::h2[span[@id='title']]]
. The extra <p>
tag is still included.
Upvotes: 4
Views: 2257
Reputation: 14238
Try this xpath :
//p[preceding-sibling::h2[1][./span[@id = 'title']]]
What does this xpath do :
It searches for p
elements which have h2
elements as preceding siblings but on one condition - only if their first preceding-sibling h2
has a child called span
with attribute id
that equals title
Why it filtered <p>Do not want this content</p>
? :
Because this p
's preceding h2
s when listed appear in the order :
<h2>Some other header</h2>
<h2>
<span id="title">Title</span>
</h2>
hence h2[1][./span[@id = 'title']]
turns out to be false, and consequently this p
is not returned.
The result on an example xml :
<root>
<h2>
<span id="title">Title</span>
</h2>
<p>Content 1</p>
<p>Content 2</p>
<p>Content 3</p>
<p>Content 4</p>
<h2>Some other header</h2>
<p>Do not want this content</p>
<p>Do not want this content too</p>
</root>
is :
'<p>Content 1</p>'
'<p>Content 2</p>'
'<p>Content 3</p>'
'<p>Content 4</p>'
Upvotes: 9
Reputation: 664
I suggest you using BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(body, 'html.parser')
p_list = []
for i in soup.find('span' ,{'id':'title'}).parent.next_siblings:
if i.name=='p':
p_list.append(i)
print p_list
Upvotes: 2