Abhishek
Abhishek

Reputation: 3068

scrapy navigating to next pages listed in the first crawl page

Hi i need help with the following code to navigate and obtain the data from the remaining pages in the link mentioned in the start_urls. Please help

class texashealthspider(CrawlSpider):

    name="texashealth2"
    allowed_domains=['www.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/']

    rules=(
        Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
        )

    def parse(self, response):
        hxs=HtmlXPathSelector(response)
        titles=hxs.select('//tbody/tr/td')
        items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

Upvotes: 0

Views: 1125

Answers (1)

Guy Gavriely
Guy Gavriely

Reputation: 11396

remove the restriction in the allowed_domains=['www.texashealth.org'], make it allowed_domains=['texashealth.org'] or allowed_domains=['jobs.texashealth.org'] - otherwise no page will be crawled

btw, consider changing function name, from docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Upvotes: 1

Related Questions