Reputation: 25606
Let's say I have a crawl spider similar to this example: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return item
Let's say I wanted to get some information like the sum of the IDs from each of the pages, or the average number of characters in the description across all of the parsed pages. How would I do it?
Also, how could I get averages for a particular category?
Upvotes: 0
Views: 1826
Reputation: 62813
You could use Scrapy's stats collector to build this kind of information or gather the necessary data to do so as you go. For per-category stats, you could use a per-category stats key.
For a quick dump of all stats gathered during a crawl, you can add STATS_DUMP = True
to your settings.py
.
Redis (via redis-py) is also a great option for stats collection.
Upvotes: 3