pielgrzym
pielgrzym

Reputation: 1677

How to access all scraped items in Scrapy item pipeline?

I have an item that has got a rank field that has to be build from analyzing other item class. I don't want to use database or other backend to store them - I just need to access all currently scraped items and do some itertools magic on them - how can I do this after spider finishes but before we export data (so rank field won't be empty)?

Upvotes: 2

Views: 2720

Answers (3)

platelminto
platelminto

Reputation: 358

You can collect all scraped items using Extensions and Signals.

from scrapy import signals


class ItemCollectorExtension:
    def __init__(self):
        self.items = []

    @classmethod
    def from_crawler(cls, crawler):
        extension = cls()

        crawler.signals.connect(extension.add_item, signal=signals.item_scraped)
        crawler.signals.connect(extension.spider_closed, signal=signals.spider_closed)

        return extension

    def spider_closed(self):
        print(self.items)  # Replace with your code

    def add_item(self, item):
        self.items.append(item)

Now, every time a new item is successfully scraped, it is added to self.items. When all items have been collected, and the spider is closing, the spider_closed function is called. Here, you can access all the collected items.

Don't forget to enable the Extension in settings.py.

Upvotes: 0

Steven Almeroth
Steven Almeroth

Reputation: 8202

This pipeline will make sure all Items have a rank.

class MyPipeline(object):

    def process_item(self, item, spider):
        item['rank'] = item.get('rank') or '1'
        return item

Upvotes: 1

dm03514
dm03514

Reputation: 55972

I think signals might help. I did something similar here

https://github.com/dm03514/CraigslistGigs/blob/master/craigslist_gigs/pipelines.py

It seems kind of hacky but In your spider you can create a property which will store all your Scraped items. In your pipeline you can register a method to Be called on spider closed signal. This method takes a spider instance as parameter. You can then access the spider property that contains all your scraped items

Upvotes: 4

Related Questions