ScrapyRequest with Rule LinkExtractor stops crawling after first page

Question

I was trying to extract all urls related to my test domain. The designed page is Javascript page and it requires selenium to crawl through all urls corresponding to this domain. But the crawler stops after crawling one page. I need to collect all urls associated with my domain.

I used scrapy_selenium module for this. and the code I used is like below

import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'example'
    start_urls = ['https://www.example.com/']
    rules = (
        Rule(LinkExtractor(allow_domains=['example.com']), follow=True),
    )

    def start_requests(self):
        for url in self.start_urls:
            print("+++++++++++++++++++++++++++++++++++++++++++++++++++++",url)
            yield SeleniumRequest(url=url, callback=self.parse,dont_filter=True)

    def parse(self, response):
        print(response.url)
        item = {'url': response.url, 'html': response.body}
        yield item

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_selenium.SeleniumMiddleware': 800
        },
        'SELENIUM_DRIVER_NAME': 'chrome',
        'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/ubuntu/selenium_drivers/chromedriver',  # path to the chrome driver executable
        'SELENIUM_DRIVER_ARGUMENTS': ['-headless']  # '-headless' for running chrome in headless mode
    }

I dont understand why the crawler stops after one page and not crawling through the pages.

ScrapyRequest with Rule LinkExtractor stops crawling after first page

Answers (0)

Related Questions