Zx3s
Zx3s

Reputation: 77

Can't get Scrapy Stats from scrapy.CrawlerProcess

I'm running scrapy spiders from another script and I need to retrieve and save to variable stats from Crawler. I've looked into docs and other StackOverflow questions but I haven't been able to solve this issue.

This is my script from which I'm running crawling:

import scrapy
from scrapy.crawler import CrawlerProcess


process = CrawlerProcess({})
process.crawl(spiders.MySpider)
process.start()

stats = CrawlerProcess.stats.getstats() # I need something like this

I would like stats to contain this piece of data (scrapy.statscollectors):

     {'downloader/request_bytes': 44216,
     'downloader/request_count': 36,
     'downloader/request_method_count/GET': 36,
     'downloader/response_bytes': 1061929,
     'downloader/response_count': 36,
     'downloader/response_status_count/200': 36,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 11, 9, 16, 31, 2, 382546),
     'log_count/DEBUG': 37,
     'log_count/ERROR': 35,
     'log_count/INFO': 9,
     'memusage/max': 62623744,
     'memusage/startup': 62623744,
     'request_depth_max': 1,
     'response_received_count': 36,
     'scheduler/dequeued': 36,
     'scheduler/dequeued/memory': 36,
     'scheduler/enqueued': 36,
     'scheduler/enqueued/memory': 36,
     'start_time': datetime.datetime(2018, 11, 9, 16, 30, 38, 140469)}

I've inspected CrawlerProcess which returns deferred and deletes crawlers from its 'crawlers' field once the scraping process is finished.

Is there a way to solve this?

Best, Peter

Upvotes: 5

Views: 1657

Answers (2)

sid10on10
sid10on10

Reputation: 1

If you want to get the stats in the script via signals. This will help -

def spider_results(spider):
    results = []
    stats = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    def crawler_stats(*args, **kwargs): # runs when spider closed
        stats.append(kwargs['sender'].stats.get_stats())

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    dispatcher.connect(crawler_stats, signal=signals.spider_closed)

    process = CrawlerProcess()
    process.crawl(spider)
    process.start()  # the script will block here until the crawling is finished
    return results, stats

I hope it helps!

Upvotes: 0

starrify
starrify

Reputation: 14731

According to the documentation, CrawlerProcess.crawl accepts either a crawler or a spider class, and you're able to create a crawler from the spider class via CrawlerProcess.create_crawler.

Thus you may create the crawler instance before starting the crawl process, and retrieve the expected attributes after that.

Below I've got you an example, by editing a few lines of your original code:

import scrapy
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        self.crawler.stats.inc_value('foo')


process = CrawlerProcess({})
crawler = process.create_crawler(TestSpider)
process.crawl(crawler)
process.start()


stats_obj = crawler.stats
stats_dict = crawler.stats.get_stats()
# perform the actions you want with the stats object or dict

Upvotes: 6

Related Questions