Christian Adib
Christian Adib

Reputation: 121

How to make scrapy spider crawl again if condition is not met?

In my close function, I am checking for the presence of a document scraped today, and I'd like to tell my Spider to scrape again if no such document is found. Basically, I need a robust way for the scraper to keep calling its crawl routine until a certain condition is met or MAX_RETRIES has been exhausted.

Upvotes: 0

Views: 571

Answers (1)

msenior_
msenior_

Reputation: 2110

To execute the spider after the spider has finished, you will need to use the reactor and CrawlerRunner class. The crawl method returns a deferred once the spider has finished scraping which you can use to add a callback in which you can do your checks. See below example where the spider will rerun until the number of retries >= 3 at which point it stops.

You will need to be careful how you do your checks because this is asynchronous code and the sequence of code execution might not be as one might expect.

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {
            "url": response.url
        }

if __name__ == '__main__':
    RETRIES = 0
    configure_logging()
    runner = CrawlerRunner()
    d = runner.crawl(ExampleSpider)
    def finished():
        global RETRIES
        # do your checks in this callback and run the spider again if needed
        # in this example, we check if the number of retries is less than the required value
        # if not we stop the reactor
        if RETRIES < 3:
            RETRIES += 1
            d = runner.crawl(ExampleSpider)
            d.addBoth(lambda _: finished())

        else:
            reactor.stop() # stop the reactor if the condition is not met

    d.addBoth(lambda _: finished())
    reactor.run()

Upvotes: 1

Related Questions