John
John

Reputation: 1

Scrapy execute Pipelines in specific order

I have a couple of spiders that are set to be executed one after the other, like

SETTINGS = {
  ...,
  "ITEM_PIPELINES": {
    "pipelines.my_spider_pipeline.MySpiderPipeline": 1,
    "pipelines.my_images_pipeline.MyImagesPipeline": 2,
  },
}

Which doesn't seem to work as expected, and I'm not sure if it's because of the code that's in pipelines.my_spider_pipeline.MySpiderPipeline;

class MySpiderPipeline(object):
    def __init__(self, stats):
        self.stats = stats


    @classmethod
    def from_crawler(cls, crawler):
        spider = cls(crawler.stats)
        crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
        return spider

The stats argument is for passing a StatsCollector class.

Now, whenever my code is executed, it goes first to from_crawler but then jumps to another function defined in MyImagesPipeline, but I need it to go to process_item in MySpiderPipeline instead, as it's there where I'm inserting data in the database, and I need the id of the database record to be available once in MyImagesPipeline.

What's to be done for that? I think this code isn't flexible at all, and any possible change would mean moving a lot of code. Open to any suggestion.

Tried not using from_crawler, but didn't change anything.

Upvotes: -2

Views: 52

Answers (1)

msenior_
msenior_

Reputation: 2120

First you need to define the id to be part of your item definition. In the process_item method of the MySpiderPipeline class you need to obtain the id of the item inserted in the database and save it as part of the item attributes.

class MySpiderPipeline:
    def process_item(self, item, spider):
        # insert item in db and get back the id inserted
        # code here

        # add the id returned to the item and return it
        item[id] = 'id'
        return item

In the process_item method of the MyImagesPipeline class you need to retrieve the value of the id that you set in the MySpiderPipeline class and use it as applicable.

class MyImagesPipeline:
    def process_item(self, item, spider):
        # retrive the id value that is part of the item
        id = item["id"]

        # use the id value as needed
        # code here
        return item

Upvotes: 0

Related Questions