Reputation: 3068
Hi i need help with the following code to navigate and obtain the data from the remaining pages in the link mentioned in the start_urls. Please help
class texashealthspider(CrawlSpider):
name="texashealth2"
allowed_domains=['www.texashealth.org']
start_urls=['http://jobs.texashealth.org/search/']
rules=(
Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
)
def parse(self, response):
hxs=HtmlXPathSelector(response)
titles=hxs.select('//tbody/tr/td')
items = []
for titles in titles:
item=TexashealthItem()
item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
items.append(item)
print items
return items
Upvotes: 0
Views: 1125
Reputation: 11396
remove the restriction in the allowed_domains=['www.texashealth.org']
, make it allowed_domains=['texashealth.org']
or allowed_domains=['jobs.texashealth.org']
- otherwise no page will be crawled
btw, consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
Upvotes: 1