Reputation: 346
The problem is quite simple, there is a spider which logs in to a website, crawls some data and then quits. The required behavior is logging in, crawling the data and then logging out.
Hard coding this in is not possible, since there are about 60 spiders, they are all inheriting from a BaseSpider.
I've tried using signals and add a logout function to the spider_idle
signal which will simply send a request to a logout URL each spider will need to provide, I couldn't get it to work though, the logout function was never called and I haven't been able to figure out why not?
Here is the code:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(BaseSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
def spider_idle(self, spider):
if not self.logged_out:
self.crawler.engine.crawl(Request(self.logout_url, callback=self.logout), spider)
def logout(self, response):
self.logged_out = True
I don't see why this wouldn't work. As I understand, the spider_idle
signal gets called when there are no more requests in the queue / the spider is done.
Upvotes: 1
Views: 839
Reputation: 21201
I have been using Scrapy for many years and ended up in a scenario like yours
Only solution to achieve your goal is to use the Python's requests library inside the spider_closed method
spider_idle
etc don't help
Upvotes: 2