Select sequence of next siblings in Scrapy

Question

I have the following html to scrap


  Title

Content 1
Content 2
Content 3
Content 4
Some other header
Do not want this content

What I want to select is a series of 4

tags after the title, and ignore everything else as soon as a non

tag is encountered.

So far my xpath is //h2[span[@id='title']]/following-sibling::p, but this also includes unwanted

tags.

I also tried the preceding-sibling approach with no luck //p[preceding-sibling::h2[span[@id='title']]]. The extra

tag is still included.

SomeDude · Accepted Answer

Try this xpath :

//p[preceding-sibling::h2[1][./span[@id = 'title']]]

What does this xpath do : It searches for p elements which have h2 elements as preceding siblings but on one condition - only if their first preceding-sibling h2 has a child called span with attribute id that equals title

Why it filtered

Do not want this content

? : Because this p's preceding h2s when listed appear in the order :

`Some other header`

`Title`

hence h2[1][./span[@id = 'title']] turns out to be false, and consequently this p is not returned.

The result on an example xml :



  Title

Content 1
Content 2
Content 3
Content 4
Some other header
Do not want this content
Do not want this content too

is :

'Content 1'
'Content 2'
'Content 3'
'Content 4'

Select sequence of next siblings in Scrapy

Answers (2)

Related Questions