dythe
dythe

Reputation: 841

Scrapy unable to traverse to next page

I am trying to make scrapy be able to traverse to the next page to continue crawling, but it simply stops when the crawler and reached the end of the page. Here is a snippet of my code:

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
    )

Any idea how to make it work? Previously i used this snippet of code and the traversing works but it simply just always stops at page 7

next_page = response.xpath('//*[(@id = "page_next")]/@href')

if next_page:
    url = response.urljoin(next_page[0].extract())
    yield scrapy.Request(url, self.parse)

EDIT:

    2017-09-09 15:48:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 15:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16626,
 'downloader/request_count': 20,
 'downloader/request_method_count/GET': 20,
 'downloader/response_bytes': 197475,
 'downloader/response_count': 20,
 'downloader/response_status_count/200': 20,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 9, 7, 48, 35, 66598),
 'item_scraped_count': 19,
 'log_count/DEBUG': 40,
 'log_count/INFO': 28,
 'log_count/WARNING': 2,
 'memusage/max': 45748224,
 'memusage/startup': 32595968,
 'request_depth_max': 1,
 'response_received_count': 20,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2017, 9, 9, 7, 47, 1, 843551)}
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Spider closed (finished)
Exception twisted._threads._ithreads.AlreadyQuit: AlreadyQuit() in <bound method JobstreetPipeline.__del__ of <jobstreet.pipelines.JobstreetPipeline object at 0x103c152d0>> ignored

Current Code:

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
#       Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse', follow=True),
    )

  def parse(self, response):  

        items = []
            ...
            ...
            ...
            item = JobsItems()
            ...
            ...
            ...
           item['jobdetailsurl'] = sel.xpath('.//a[@class="position-title-link"]/@href').extract()[0]

            request = scrapy.Request(item['jobdetailsurl'], callback=self.parse_jobdetails)
            request.meta['item'] = item
            yield request

Upvotes: 0

Views: 280

Answers (2)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

Few issues. SgmlLinkExtractor is deprecated. You should use LinkExtractor.

And callback="self.parse" should be the function name without self. And if you want to extract data from response then you should use another function

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse_response', follow=True),
    )

    def parse_response(self, response):
        yield {"page": response.url}

Edit-2

I added response.body also to item yield and found below

 <script src="https://www.google.com/recaptcha/api.js?hl="></script>\n\n    </head>\n\n\t<body>\n\n        <div id="main" class="main">\n            <div class="captchaDiv">\n\t\t\t\t<div id="headerDiv" class="headerDiv">\n\t\t\t\t\t<div id="insideHeaderDiv" class="insideHeaderDiv"></div>\n\t\t\t\t</div>\n\t\t\t\t<div class="separator"></div>\n\t\t\t\t<div id="mainDiv" class="mainDiv">\n\t\t\t\t\t<div id="titleText" class="titleText">Security Check</div>\n\t\t\t\t\t<div id="instructionsText" class="instructionsText">Before we allow your access to this page, we need to confirm if you are a human (it\'s a spam prevention thing)</div>\n\t\t\t\t\t<div class="g-recaptcha" data-sitekey=\'6LcX6A4UAAAAAKK1WiuMtXOj6Ib-lXZwVaWGvkq6\' data-callback=\'mprv_captcha_submitUserInput\'></div>\n\t\t\t\t\t<div id="footerText" class="footerText">SPAM Prevention</div>\n\t\t\t\t</div>\n            </div>\n        </div>\n\t</body>\n</html>'

So the page shows a captcha after few request and hence the scraping stops there. You need to work around that by either slowing down request, or solving captcha somehow, or using a service like crawlera

Edit-1

Output from the crawl

2017-09-09 13:10:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16> (referer: https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=15&src=16&srcr=16)
2017-09-09 13:10:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16>
{'page': 'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16'}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 13:10:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15664,
 'downloader/request_count': 16,
 'downloader/request_method_count/GET': 16,
 'downloader/response_bytes': 306524,
 'downloader/response_count': 16,
 'downloader/response_status_count/200': 16,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 9, 7, 40, 6, 294685),
 'item_scraped_count': 15,
 'log_count/DEBUG': 31,
 'log_count/INFO': 7,
 'memusage/max': 49135616,
 'memusage/startup': 49135616,
 'request_depth_max': 15,
 'response_received_count': 16,
 'scheduler/dequeued': 16,
 'scheduler/dequeued/memory': 16,
 'scheduler/enqueued': 16,
 'scheduler/enqueued/memory': 16,
 'start_time': datetime.datetime(2017, 9, 9, 7, 39, 56, 380899)}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 1

Anurag Misra
Anurag Misra

Reputation: 1544

you can use following code for you pagination.

next_page = response.xpath('//*[@id="page_next"]/@href').extract_first()

if next_page:
    yield scrapy.Request(next_page, self.parse)

Upvotes: 0

Related Questions