Khanetor
Khanetor

Reputation: 12302

Select sequence of next siblings in Scrapy

I have the following html to scrap

<h2>
  <span id="title">Title</span>
</h2>
<p>Content 1</p>
<p>Content 2</p>
<p>Content 3</p>
<p>Content 4</p>
<h2>Some other header</h2>
<p>Do not want this content</p>

What I want to select is a series of 4 <p> tags after the title, and ignore everything else as soon as a non <p> tag is encountered.

So far my xpath is //h2[span[@id='title']]/following-sibling::p, but this also includes unwanted

tags.

I also tried the preceding-sibling approach with no luck //p[preceding-sibling::h2[span[@id='title']]]. The extra <p> tag is still included.

Upvotes: 4

Views: 2257

Answers (2)

SomeDude
SomeDude

Reputation: 14238

Try this xpath :

//p[preceding-sibling::h2[1][./span[@id = 'title']]]

What does this xpath do : It searches for p elements which have h2 elements as preceding siblings but on one condition - only if their first preceding-sibling h2 has a child called span with attribute id that equals title

Why it filtered <p>Do not want this content</p> ? : Because this p's preceding h2s when listed appear in the order :

<h2>Some other header</h2>

<h2> <span id="title">Title</span> </h2>

hence h2[1][./span[@id = 'title']] turns out to be false, and consequently this p is not returned.

The result on an example xml :

<root>
<h2>
  <span id="title">Title</span>
</h2>
<p>Content 1</p>
<p>Content 2</p>
<p>Content 3</p>
<p>Content 4</p>
<h2>Some other header</h2>
<p>Do not want this content</p>
<p>Do not want this content too</p>
</root>

is :

'<p>Content 1</p>'
'<p>Content 2</p>'
'<p>Content 3</p>'
'<p>Content 4</p>'

Upvotes: 9

CK Chen
CK Chen

Reputation: 664

I suggest you using BeautifulSoup.

from bs4 import BeautifulSoup
soup =  BeautifulSoup(body, 'html.parser')
p_list = []
for i in soup.find('span' ,{'id':'title'}).parent.next_siblings:
    if i.name=='p':
        p_list.append(i)
print p_list

Upvotes: 2

Related Questions