Hellohowdododo
Hellohowdododo

Reputation: 406

How to get stats value after CrawlerProcess finished, i.e. at line after process.start()

I am using this code somewhere inside spider:

raise scrapy.exceptions.CloseSpider('you_need_to_rerun')

So, when this exceptions raised, eventually my spider closing working and I get in console stats with this string:

'finish_reason': 'you_need_to_rerun',

But - how I can get it from code? Cause I want to run spider again in loop, based on info from this stats, something like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import spaida.spiders.spaida_spider
import spaida.settings


you_need_to_rerun = True
while you_need_to_rerun:
    process = CrawlerProcess(get_project_settings())
    process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider)
    process.start(stop_after_crawl=False)  # the script will block here until the crawling is finished
    finish_reason = 'and here I get somehow finish_reason from stats' # <- how??
    if finish_reason == 'finished':
        print("everything ok, I don't need to rerun this")
        you_need_to_rerun = False

I found in docs this thing, but can't get it right, where is that "The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain name.": https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats

P.S.: I'm also getting error twisted.internet.error.ReactorNotRestartable when using process.start(), and recommendations to use process.start(stop_after_crawl=False) - and then spider just stops and do nothing, but this is another problem...

Upvotes: 1

Views: 768

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

You need to access stats object via Crawler object:

process = CrawlerProcess(get_project_settings())
crawler = process.crawlers[0]
reason = crawler.stats.get_value('finish_reason')

Upvotes: 0

Related Questions