Scrapy CrawlSpider Post-processing: Finding an Average

Question

Let's say I have a crawl spider similar to this example: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        return item

Let's say I wanted to get some information like the sum of the IDs from each of the pages, or the average number of characters in the description across all of the parsed pages. How would I do it?

Also, how could I get averages for a particular category?

Jonny Buchanan · Accepted Answer

You could use Scrapy's stats collector to build this kind of information or gather the necessary data to do so as you go. For per-category stats, you could use a per-category stats key.

For a quick dump of all stats gathered during a crawl, you can add STATS_DUMP = True to your settings.py.

Redis (via redis-py) is also a great option for stats collection.

Scrapy CrawlSpider Post-processing: Finding an Average

Answers (1)

Related Questions