Reputation: 574
I'm crawling the site https://oa.mo.gov/personnel/classification-specifications/all. I need to get to each position page and then extract some information. I figure I could do this with a LinkExtractor or by finding all the URLs with xPath, which is what I'm attempting below. The spider doesn't show any errors, but also doesn't crawl any pages:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from StateOfMoJDs.items import StateOfMoJDs
class StateOfMoJDs(scrapy.Spider):
name = 'StateOfMoJDs'
allowed_domains = ['oa.mo.gov']
start_urls = ['https://oa.mo.gov/personnel/classification-specifications/all']
def parse(self, response):
for url in response.xpath('//span[@class="field-content"]/a/@href').extract():
url2 = 'https://oa.mo.gov' + url
scrapy.Request(url2, callback=self.parse_job)
def parse_job(self, response):
item = StateOfMoJDs()
item["url"] = response.url
item["jobtitle"] = response.xpath('//span[@class="page-title"]/text()').extract()
item["salaryrange"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[2]/div[1]/div[2]/div/text()').extract()
item["classnumber"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[1]/div[1]/div/div[2]/div//text()').extract()
item["paygrade"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[3]/div/div[2]/div//text()').extract()
item["definition"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[1]/div[2]/div/p//text()').extract()
item["jobduties"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[2]/div[2]/div/div//text()').extract()
item["basicqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[1]/div/div//text()').extract()
item["specialqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[2]/div[2]/div//text()').extract()
item["keyskills"] = response.xpath('//*[@id="class-spec-compact"]/div/div[4]/div/div[2]/div/div//text()').extract()
yield item
When using scrapy shell, response.xpath('//span[@class="field-content"]/a/@href').extract()
yields a comma-separated list of relative URLs:
['/personnel/classification-specifications/3005', '/personnel/classification-specifications/3006', '/personnel/classification-specifications/3007', ...]
Upvotes: 2
Views: 61
Reputation: 8192
In your parse()
method you need to yield
your requests:
yield scrapy.Request(url2, callback=self.parse_job)
Upvotes: 2