Scrapy unable to traverse to next page

Question

I am trying to make scrapy be able to traverse to the next page to continue crawling, but it simply stops when the crawler and reached the end of the page. Here is a snippet of my code:

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
    )

Any idea how to make it work? Previously i used this snippet of code and the traversing works but it simply just always stops at page 7

next_page = response.xpath('//*[(@id = "page_next")]/@href')

if next_page:
    url = response.urljoin(next_page[0].extract())
    yield scrapy.Request(url, self.parse)

EDIT:

    2017-09-09 15:48:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 15:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16626,
 'downloader/request_count': 20,
 'downloader/request_method_count/GET': 20,
 'downloader/response_bytes': 197475,
 'downloader/response_count': 20,
 'downloader/response_status_count/200': 20,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 9, 7, 48, 35, 66598),
 'item_scraped_count': 19,
 'log_count/DEBUG': 40,
 'log_count/INFO': 28,
 'log_count/WARNING': 2,
 'memusage/max': 45748224,
 'memusage/startup': 32595968,
 'request_depth_max': 1,
 'response_received_count': 20,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2017, 9, 9, 7, 47, 1, 843551)}
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Spider closed (finished)
Exception twisted._threads._ithreads.AlreadyQuit: AlreadyQuit() in > ignored

Current Code:

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
#       Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse', follow=True),
    )

  def parse(self, response):  

        items = []
            ...
            ...
            ...
            item = JobsItems()
            ...
            ...
            ...
           item['jobdetailsurl'] = sel.xpath('.//a[@class="position-title-link"]/@href').extract()[0]

            request = scrapy.Request(item['jobdetailsurl'], callback=self.parse_jobdetails)
            request.meta['item'] = item
            yield request

Tarun Lalwani · Accepted Answer

Few issues. SgmlLinkExtractor is deprecated. You should use LinkExtractor.

And callback="self.parse" should be the function name without self. And if you want to extract data from response then you should use another function

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse_response', follow=True),
    )

    def parse_response(self, response):
        yield {"page": response.url}

Edit-2

I added response.body also to item yield and found below

 

    

	

        
            
				
					
				
				
				
					Security Check
					Before we allow your access to this page, we need to confirm if you are a human (it\'s a spam prevention thing)
					
					SPAM Prevention
				
            
        
	
'

So the page shows a captcha after few request and hence the scraping stops there. You need to work around that by either slowing down request, or solving captcha somehow, or using a service like crawlera

Edit-1

Output from the crawl

2017-09-09 13:10:06 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=15&src=16&srcr=16)
2017-09-09 13:10:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16>
{'page': 'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16'}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 13:10:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15664,
 'downloader/request_count': 16,
 'downloader/request_method_count/GET': 16,
 'downloader/response_bytes': 306524,
 'downloader/response_count': 16,
 'downloader/response_status_count/200': 16,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 9, 7, 40, 6, 294685),
 'item_scraped_count': 15,
 'log_count/DEBUG': 31,
 'log_count/INFO': 7,
 'memusage/max': 49135616,
 'memusage/startup': 49135616,
 'request_depth_max': 15,
 'response_received_count': 16,
 'scheduler/dequeued': 16,
 'scheduler/dequeued/memory': 16,
 'scheduler/enqueued': 16,
 'scheduler/enqueued/memory': 16,
 'start_time': datetime.datetime(2017, 9, 9, 7, 39, 56, 380899)}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy unable to traverse to next page

Answers (2)

Related Questions