Reputation: 121
In my close
function, I am checking for the presence of a document scraped today, and I'd like to tell my Spider to scrape again if no such document is found. Basically, I need a robust way for the scraper to keep calling its crawl routine until a certain condition is met or MAX_RETRIES
has been exhausted.
Upvotes: 0
Views: 571
Reputation: 2110
To execute the spider after the spider has finished, you will need to use the reactor
and CrawlerRunner
class. The crawl
method returns a deferred once the spider has finished scraping which you can use to add a callback
in which you can do your checks. See below example where the spider will rerun until the number of retries >= 3 at which point it stops.
You will need to be careful how you do your checks because this is asynchronous code and the sequence of code execution might not be as one might expect.
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
yield {
"url": response.url
}
if __name__ == '__main__':
RETRIES = 0
configure_logging()
runner = CrawlerRunner()
d = runner.crawl(ExampleSpider)
def finished():
global RETRIES
# do your checks in this callback and run the spider again if needed
# in this example, we check if the number of retries is less than the required value
# if not we stop the reactor
if RETRIES < 3:
RETRIES += 1
d = runner.crawl(ExampleSpider)
d.addBoth(lambda _: finished())
else:
reactor.stop() # stop the reactor if the condition is not met
d.addBoth(lambda _: finished())
reactor.run()
Upvotes: 1