Reputation: 161
def parse_linkpage(self, response):
hxs = HtmlXPathSelector(response)
item = QualificationItem()
xpath = """
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p
"""
item['Qualification'] = hxs.select(xpath).extract()[1:]
item['Country'] = response.meta['a_of_the_link']
return item
So I was wondering if I could get my code to stop scraping after reaching the end of the <h2>
.
Here is the webpage:
<h2>Entry requirements for undergraduate courses</h2>
<p>Example1</p>
<p>Example2</p>
<h2>Postgraduate Courses</h2>
<p>Example3</p>
<p>Example4</p>
I want these results:
Example1
Example2
But I get:
Example1
Example2
Example3
Example4
I know I could change this line,
item['Qualification'] = hxs.select(xpath).extract()
to,
item['Qualification'] = hxs.select(xpath).extract()[0:2]
But this scraper looks at many different pages that might have more than 2 paragraphs in the first header meaning it would leave this information out.
I'm wondering if there is a way of just telling it to extract the exact data that follows the header I want and not everything?
Upvotes: 2
Views: 362
Reputation: 7173
Maybe you could use this xpath
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])]
you can just add another predicate of the following-sibling::p
not to include those p
s whose preceding-sibling is not equal to "Entry requirements for undergraduate courses"
Upvotes: 0
Reputation: 20748
It's not very pretty or easy to read, but you can use EXSLT extensions to XPath and use set:difference()
operation:
>>> selector.xpath("""
set:difference(//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p,
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::h2[1]
/following-sibling::p)""").extract()
[u'<p>Example1</p>', u'<p>Example2</p>']
The idea is to select all p
following your target h2
, and exclude those p
that are after the next h2
In a bit easier to read version:
>>> for h2 in selector.xpath('//h2[normalize-space(.)="Entry requirements for undergraduate courses"]'):
... paragraphs = h2.xpath("""set:difference(./following-sibling::p,
... ./following-sibling::h2[1]/following-sibling::p)""").extract()
... print paragraphs
...
[u'<p>Example1</p>', u'<p>Example2</p>']
>>>
Upvotes: 2