user3808579
user3808579

Reputation: 41

Count scraped items from scrapy

Looking to just count the number of things scraped. New to python and scraping just following the example and what to know how to just count the number of times Albert Einstein shows up and print to a json file. Just can not get it to print to file using print, yield, or return.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "author"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        i=0
        for quote in response.css('div.quote'):
            author = quote.css("small.author::text").get()
            if author == "Albert Einstein":
                i+=1


        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Upvotes: 1

Views: 1954

Answers (2)

user3808579
user3808579

Reputation: 41

I found out how to get to the item_scraped_count that shows up in the log output at the end of the spider.

import scrapy
from scrapy import signals

class CountSpider(scrapy.Spider):
    name = 'count'                                                              
    start_urls = ['https://example.com']

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CountSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        stats = spider.crawler.stats.get_stats()                                     
        numcount = str(stats['item_scraped_count'])  
        Here I can create a csv file with the stats

Upvotes: 2

renatodvc
renatodvc

Reputation: 2564

In scrapy request are made asynchronously, and each request will callback to the parse function indepedently. Your i variable is not an instance variable, so it's scope is limited to each function call.

Even if that wasn't the case, the recursion would turn your counter to 0 in each callback.

I would suggest you to take a look at scrapy items, at the end of the scrapy process it will return a counter with the nr of scraped items. Although that maybe an overkill if you don't want to store anymore information but the number of occurrences of "Albert Einstein".

If that's all you want, you can use a dirtier solution, set your counter var to be a instance var and have parse method to increment it, like this:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "author"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    counter = 0

    def parse(self, response):
        for quote in response.css('div.quote'):
            author = quote.css("small.author::text").get()
            if author == "Albert Einstein":
                self.counter += 1


        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Upvotes: 0

Related Questions