Reputation: 11
I was trying to extract all urls related to my test domain. The designed page is Javascript page and it requires selenium to crawl through all urls corresponding to this domain. But the crawler stops after crawling one page. I need to collect all urls associated with my domain.
I used scrapy_selenium module for this. and the code I used is like below
import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = 'example'
start_urls = ['https://www.example.com/']
rules = (
Rule(LinkExtractor(allow_domains=['example.com']), follow=True),
)
def start_requests(self):
for url in self.start_urls:
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++",url)
yield SeleniumRequest(url=url, callback=self.parse,dont_filter=True)
def parse(self, response):
print(response.url)
item = {'url': response.url, 'html': response.body}
yield item
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy_selenium.SeleniumMiddleware': 800
},
'SELENIUM_DRIVER_NAME': 'chrome',
'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/ubuntu/selenium_drivers/chromedriver', # path to the chrome driver executable
'SELENIUM_DRIVER_ARGUMENTS': ['-headless'] # '-headless' for running chrome in headless mode
}
I dont understand why the crawler stops after one page and not crawling through the pages.
Upvotes: 0
Views: 48