Dyl10
Dyl10

Reputation: 161

Can I scrape content only after a specific header?

def parse_linkpage(self, response):
    hxs = HtmlXPathSelector(response)
    item = QualificationItem()
    xpath = """
            //h2[normalize-space(.)="Entry requirements for undergraduate courses"]
             /following-sibling::p
            """
    item['Qualification'] = hxs.select(xpath).extract()[1:]
    item['Country'] = response.meta['a_of_the_link']
    return item

So I was wondering if I could get my code to stop scraping after reaching the end of the <h2>.

Here is the webpage:

<h2>Entry requirements for undergraduate courses</h2>
<p>Example1</p>
<p>Example2</p>
<h2>Postgraduate Courses</h2>
<p>Example3</p>
<p>Example4</p>

I want these results:

Example1
Example2

But I get:

Example1
Example2
Example3
Example4

I know I could change this line,

item['Qualification'] = hxs.select(xpath).extract()

to,

item['Qualification'] = hxs.select(xpath).extract()[0:2]

But this scraper looks at many different pages that might have more than 2 paragraphs in the first header meaning it would leave this information out.

I'm wondering if there is a way of just telling it to extract the exact data that follows the header I want and not everything?

Upvotes: 2

Views: 362

Answers (2)

Joel M. Lamsen
Joel M. Lamsen

Reputation: 7173

Maybe you could use this xpath

//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
         /following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])]

you can just add another predicate of the following-sibling::p not to include those ps whose preceding-sibling is not equal to "Entry requirements for undergraduate courses"

Upvotes: 0

paul trmbrth
paul trmbrth

Reputation: 20748

It's not very pretty or easy to read, but you can use EXSLT extensions to XPath and use set:difference() operation:

>>> selector.xpath("""
    set:difference(//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
                    /following-sibling::p,
                   //h2[normalize-space(.)="Entry requirements for undergraduate courses"]
                    /following-sibling::h2[1]
                    /following-sibling::p)""").extract()
[u'<p>Example1</p>', u'<p>Example2</p>']

The idea is to select all p following your target h2, and exclude those p that are after the next h2

In a bit easier to read version:

>>> for h2 in selector.xpath('//h2[normalize-space(.)="Entry requirements for undergraduate courses"]'):
...     paragraphs = h2.xpath("""set:difference(./following-sibling::p,
...                                             ./following-sibling::h2[1]/following-sibling::p)""").extract()
...     print paragraphs
... 
[u'<p>Example1</p>', u'<p>Example2</p>']
>>> 

Upvotes: 2

Related Questions