mikebridge
mikebridge

Reputation: 4575

Transform final output in scrapy?

I have a scrapy process which successfully parses items and sub-items, but I can't see whether there's a final hook which would allow me to transform the final data result after everything has been parsed, but before it is formatted as output.

My spider is doing something like this:

class MySpider(scrapy.Spider):

    def parse(self, response, **kwargs):
        for part in [1,2,3]:
            url = f'{response.request.url}?part={part}'
            yield scrapy.Request(url=url, callback=self.parse_part, meta={'part': part})

    def parse_part(self, response, **kwargs)
        # ... 
        for subpart in part:
            yield {
               'title': self.get_title(subpart),
               'tag': self.get_tag(subpart)
            }
        }

This works well, but I haven't been able to figure out where I can take the complete resulting structure and transform it before outputting it to json (or whatever). I thought maybe I could do this in the process_spider_output call of Middleware, but this only seems to give me the single items, not the final structure.

Upvotes: 0

Views: 168

Answers (1)

Felix Eklöf
Felix Eklöf

Reputation: 3720

You can use this method to do something after the spider has closed:

def spider_closed(self):

However, you won't be able to modify items in the method. To modify items you need to write a custom pipeline. In the pipeline you write a method which gets called every time your spider yields an item. So in the method you could save all items to a list and then transform all items in the list in the Pipeline method close_spider

Read here on how to write your own pipeline

Example

Let's say you want to have all you items as JSON to maybe send a request to an API. You have to activate your pipeline in settings.py for it to be used.

import json

class MyPipeline:

    def __init__(self, *args, **kwargs):
        self.items = []

    def process_item(self, item, spider):
        self.items.append(item)
        return item

    def close_spider(self, spider):
        # In the method to can iterate self.items and transform them to your preference.
        json_data = json.dumps(self.items)
        print(json_data)

Upvotes: 1

Related Questions