scrapy navigating to next pages listed in the first crawl page

Question

Hi i need help with the following code to navigate and obtain the data from the remaining pages in the link mentioned in the start_urls. Please help

class texashealthspider(CrawlSpider):

    name="texashealth2"
    allowed_domains=['www.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/']

    rules=(
        Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
        )

    def parse(self, response):
        hxs=HtmlXPathSelector(response)
        titles=hxs.select('//tbody/tr/td')
        items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

Guy Gavriely · Accepted Answer

remove the restriction in the allowed_domains=['www.texashealth.org'], make it allowed_domains=['texashealth.org'] or allowed_domains=['jobs.texashealth.org'] - otherwise no page will be crawled

btw, consider changing function name, from docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

scrapy navigating to next pages listed in the first crawl page

Answers (1)

Related Questions