dythe
dythe

Reputation: 841

Scrapy ignoring start page and continuing to next page

I have a scrapy spider that im trying to do pagination but everytime i start the crawling process, it seems to be skipping the start page which is page 1 and going to page 2 right away

class IT(CrawlSpider):
    name = 'IT'

allowed_domains = ["jobscentral.com.sg"]
start_urls = [
    'https://jobscentral.com.sg/jobs-accounting',
]

rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ),
                     restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), 
                     callback='parse_item', follow=True),
        )

def parse_item(self, response):
    self.logger.info("Response %d for %r" % (response.status, response.url))
    #self.logger.info("base url %s", get_base_url(response))
    items = []
    self.logger.info("Visited Outer Link %s", response.url)

    for loop in response.xpath('//div[@class="col-md-11"]'):
        item = JobsItems()
        t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip()

.... 
more codes here

Upvotes: 0

Views: 277

Answers (1)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

Yes that is correct because when you use start_urls the response goes to the parse method the first time. This method is defined by the CrawlSpider internally to execute the crawling rules. So if you need to process the response from the first response also. You can use something like below

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["jobscentral.com.sg"]
    start_urls = [
        'https://jobscentral.com.sg/jobs-accounting',
    ]
    rules = (
        Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='parse_item', follow=True),
    )

    first_response = True

    def parse(self, response):
        if self.first_response = True:
            # use it or pass it to some other function
            for r in parse_item(response):
                yield r
           self.first_response = False

        # Pass the response to crawlspider 
        for r in super(IT, self).parse(response)
            yield r


    def parse_item(self, response):

        self.logger.info("Response %d for %r" % (response.status, response.url))

Upvotes: 1

Related Questions