Reputation: 1677
I have an item that has got a rank field that has to be build from analyzing other item class. I don't want to use database or other backend to store them - I just need to access all currently scraped items and do some itertools magic on them - how can I do this after spider finishes but before we export data (so rank field won't be empty)?
Upvotes: 2
Views: 2720
Reputation: 358
You can collect all scraped items using Extensions and Signals.
from scrapy import signals
class ItemCollectorExtension:
def __init__(self):
self.items = []
@classmethod
def from_crawler(cls, crawler):
extension = cls()
crawler.signals.connect(extension.add_item, signal=signals.item_scraped)
crawler.signals.connect(extension.spider_closed, signal=signals.spider_closed)
return extension
def spider_closed(self):
print(self.items) # Replace with your code
def add_item(self, item):
self.items.append(item)
Now, every time a new item is successfully scraped, it is added to self.items
. When all items have been collected, and the spider is closing, the spider_closed
function is called. Here, you can access all the collected items.
Don't forget to enable the Extension in settings.py
.
Upvotes: 0
Reputation: 8202
This pipeline will make sure all Items have a rank.
class MyPipeline(object):
def process_item(self, item, spider):
item['rank'] = item.get('rank') or '1'
return item
Upvotes: 1
Reputation: 55972
I think signals might help. I did something similar here
https://github.com/dm03514/CraigslistGigs/blob/master/craigslist_gigs/pipelines.py
It seems kind of hacky but In your spider you can create a property which will store all your Scraped items. In your pipeline you can register a method to Be called on spider closed signal. This method takes a spider instance as parameter. You can then access the spider property that contains all your scraped items
Upvotes: 4