Reputation: 841
I am trying to make scrapy be able to traverse to the next page to continue crawling, but it simply stops when the crawler and reached the end of the page. Here is a snippet of my code:
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
)
Any idea how to make it work? Previously i used this snippet of code and the traversing works but it simply just always stops at page 7
next_page = response.xpath('//*[(@id = "page_next")]/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
EDIT:
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 15:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16626,
'downloader/request_count': 20,
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 197475,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 9, 7, 48, 35, 66598),
'item_scraped_count': 19,
'log_count/DEBUG': 40,
'log_count/INFO': 28,
'log_count/WARNING': 2,
'memusage/max': 45748224,
'memusage/startup': 32595968,
'request_depth_max': 1,
'response_received_count': 20,
'scheduler/dequeued': 20,
'scheduler/dequeued/memory': 20,
'scheduler/enqueued': 20,
'scheduler/enqueued/memory': 20,
'start_time': datetime.datetime(2017, 9, 9, 7, 47, 1, 843551)}
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Spider closed (finished)
Exception twisted._threads._ithreads.AlreadyQuit: AlreadyQuit() in <bound method JobstreetPipeline.__del__ of <jobstreet.pipelines.JobstreetPipeline object at 0x103c152d0>> ignored
Current Code:
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
# Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse', follow=True),
)
def parse(self, response):
items = []
...
...
...
item = JobsItems()
...
...
...
item['jobdetailsurl'] = sel.xpath('.//a[@class="position-title-link"]/@href').extract()[0]
request = scrapy.Request(item['jobdetailsurl'], callback=self.parse_jobdetails)
request.meta['item'] = item
yield request
Upvotes: 0
Views: 280
Reputation: 146510
Few issues. SgmlLinkExtractor
is deprecated. You should use LinkExtractor
.
And callback="self.parse"
should be the function name without self. And if you want to extract data from response then you should use another function
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse_response', follow=True),
)
def parse_response(self, response):
yield {"page": response.url}
Edit-2
I added response.body
also to item yield and found below
<script src="https://www.google.com/recaptcha/api.js?hl="></script>\n\n </head>\n\n\t<body>\n\n <div id="main" class="main">\n <div class="captchaDiv">\n\t\t\t\t<div id="headerDiv" class="headerDiv">\n\t\t\t\t\t<div id="insideHeaderDiv" class="insideHeaderDiv"></div>\n\t\t\t\t</div>\n\t\t\t\t<div class="separator"></div>\n\t\t\t\t<div id="mainDiv" class="mainDiv">\n\t\t\t\t\t<div id="titleText" class="titleText">Security Check</div>\n\t\t\t\t\t<div id="instructionsText" class="instructionsText">Before we allow your access to this page, we need to confirm if you are a human (it\'s a spam prevention thing)</div>\n\t\t\t\t\t<div class="g-recaptcha" data-sitekey=\'6LcX6A4UAAAAAKK1WiuMtXOj6Ib-lXZwVaWGvkq6\' data-callback=\'mprv_captcha_submitUserInput\'></div>\n\t\t\t\t\t<div id="footerText" class="footerText">SPAM Prevention</div>\n\t\t\t\t</div>\n </div>\n </div>\n\t</body>\n</html>'
So the page shows a captcha after few request and hence the scraping stops there. You need to work around that by either slowing down request, or solving captcha somehow, or using a service like crawlera
Edit-1
Output from the crawl
2017-09-09 13:10:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16> (referer: https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=15&src=16&srcr=16)
2017-09-09 13:10:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16>
{'page': 'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16'}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 13:10:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15664,
'downloader/request_count': 16,
'downloader/request_method_count/GET': 16,
'downloader/response_bytes': 306524,
'downloader/response_count': 16,
'downloader/response_status_count/200': 16,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 9, 7, 40, 6, 294685),
'item_scraped_count': 15,
'log_count/DEBUG': 31,
'log_count/INFO': 7,
'memusage/max': 49135616,
'memusage/startup': 49135616,
'request_depth_max': 15,
'response_received_count': 16,
'scheduler/dequeued': 16,
'scheduler/dequeued/memory': 16,
'scheduler/enqueued': 16,
'scheduler/enqueued/memory': 16,
'start_time': datetime.datetime(2017, 9, 9, 7, 39, 56, 380899)}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Spider closed (finished)
Upvotes: 1
Reputation: 1544
you can use following code for you pagination.
next_page = response.xpath('//*[@id="page_next"]/@href').extract_first()
if next_page:
yield scrapy.Request(next_page, self.parse)
Upvotes: 0