Geo
Geo

Reputation: 96907

Is it possible to have dynamically created pipelines in scrapy?

I have a pipeline that posts data to a webhook. I'd like to reuse it for another spider. My pipeline is like this:

class Poster(object):
    def process_item(self, item, spider):
        item_attrs = {
          "url": item['url'], "price": item['price'],
          "description": item['description'], "title": item['title']
        }

        data = json.dumps({"events": [item_attrs]})

        poster = requests.post(
            "http://localhost:3000/users/1/web_requests/69/supersecretstring",
            data = data, headers = {'content-type': 'application/json'}
        )

        if poster.status_code != 200:
            raise DropItem("error posting event %s code=%s" % (item, poster.status_code))

        return item

The thing is, in another spider, I'd need to post to another url, and potentially use different attributes. Is it possible to specify instead of this:

class Spider(scrapy.Spider):
    name = "products"
    start_urls = (
        'some_url',
    )
    custom_settings = {
        'ITEM_PIPELINES': {
           'spider.pipelines.Poster': 300,
        },
    }

something like:

    custom_settings = {
        'ITEM_PIPELINES': {
           spider.pipelines.Poster(some_other_url, some_attributes): 300,
        },
    }

I know the URL I would need when I'm creating the spider, as well as the fields I would be extracting.

Upvotes: 2

Views: 441

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

There are few ways of doing this, but the simpliest one would be to use open_spider(self, spider) in your pipeline.

Example of usecase:

scrapy crawl myspider -a pipeline_count=123

Then set up your pipeline to read this:

class MyPipeline(object):
    count = None

    def open_spider(self, spider):
        count = getattr(spider, 'pipeline_count')
        self.count = int(count)

    # or as starrify pointed out in the comment below
    # access it directly in process_item
    def process_item(self, item, spider):
        count = getattr(spider, 'pipeline_count')
        item['count'] = count
        return item
    <...>

Upvotes: 3

Related Questions