Reputation: 239
I was wondering if there is a way to restart a scrapy crawler. This is what my code looks like:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
results = set([])
class SitemapCrawler(CrawlSpider):
name = "Crawler"
start_urls = ['www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
href = response.xpath('//a/@href').getall()
results.add(response.url)
for link in href:
results.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in results:
print(link)
If I try calling start()
twice it runs it once than gives me this error:
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
I know this is a general question, so I don't expect any code but I just want to know how I can fix this issue. Thanks in advance.
Upvotes: 1
Views: 842
Reputation: 96
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished():
print("finished :D")
d.addCallback(finished)
reactor.run()
Upvotes: 1