Reputation: 841
I have a scrapy spider that im trying to do pagination but everytime i start the crawling process, it seems to be skipping the start page which is page 1 and going to page 2 right away
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["jobscentral.com.sg"]
start_urls = [
'https://jobscentral.com.sg/jobs-accounting',
]
rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ),
restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)),
callback='parse_item', follow=True),
)
def parse_item(self, response):
self.logger.info("Response %d for %r" % (response.status, response.url))
#self.logger.info("base url %s", get_base_url(response))
items = []
self.logger.info("Visited Outer Link %s", response.url)
for loop in response.xpath('//div[@class="col-md-11"]'):
item = JobsItems()
t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip()
....
more codes here
Upvotes: 0
Views: 277
Reputation: 146510
Yes that is correct because when you use start_urls
the response goes to the parse
method the first time. This method is defined by the CrawlSpider
internally to execute the crawling rules. So if you need to process the response from the first response also. You can use something like below
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["jobscentral.com.sg"]
start_urls = [
'https://jobscentral.com.sg/jobs-accounting',
]
rules = (
Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='parse_item', follow=True),
)
first_response = True
def parse(self, response):
if self.first_response = True:
# use it or pass it to some other function
for r in parse_item(response):
yield r
self.first_response = False
# Pass the response to crawlspider
for r in super(IT, self).parse(response)
yield r
def parse_item(self, response):
self.logger.info("Response %d for %r" % (response.status, response.url))
Upvotes: 1