How to perform one final request in scrapy after all requests are done?

Question

In the spider I'm bulding, I'm required to login to the website to start performing requests (which is quite simple), and then I go through a loop to perform some thousand requests.

However, in this website in particular, if I do not logout, I get a 10 minute penalty before I can log in again. So I've tried to logout after the loop is done, with a lower priority, like this:

def parse_after_login(self, response):
    for item in [long_list]:
        yield scrapy.Request(..., callback=self.parse_result, priority=100)

    # After all requests have been made, perform logout:
    yield scrapy.Request('/logout/', callback=self.parse_logout, priority=0)

However, there is no guarantee that the logout request won't be ready before the other requests are done processing, so a premature logout will invalidate the other requests.

I have found no way of performing a new request with the spider_closed signal.

How can I perform a new request after all other requests are completed?

eLRuLL · Accepted Answer

you can use the spider_idle signal, which could send a request when the spider stopped processing everything.

so once you connect a method to the spider_idle signal with:

self.crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)

you can now use the self.spider_idle method to call final tasks once the spider stopped processing everything:

class MySpider(Spider):
    ...
    self.logged_out = False

    ...
    def spider_idle(self, spider):
        if not self.logged_out:
            self.logged_out = True
            req = Request('someurl', callback=self.parse_logout)
            self.crawler.engine.crawl(req, spider)

How to perform one final request in scrapy after all requests are done?

Answers (1)

Related Questions