R0byn
R0byn

Reputation: 423

Python/Scrapy: Custom pipeline has no effect / download files with custom filename

This is a follow-up question to my initial question. I want to download PDFs and save them on harddisk with custom filename.

For the custom filename I tried this code in my pipelines.py according to this recommendation:

class PrangerPipeline(object):
    def process_item(self, item, spider):
        return item

    def file_path(self, request, response=None, info=None):
        original_path = super(PrangerPipeline, self).file_path(request, response=None, info=None)
        sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
        return request.meta.get('filename','') + "_" + sha1_and_extension

    def get_media_requests(self, item, info):
        file_url = item['file_url']
        meta = {'filename': item['name']}
        yield Request(url=file_url, meta=meta)

In my settings.py I have:

ITEM_PIPELINES = {
    'pranger.pipelines.PrangerPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 2,
}

But the files keep being saved only with their SHA1-hash, for example: a8569143c987cdd43dd1f6d9a6f98b7aa6fbc284.PDF. So my custom file_path function seems not to be used by Scrapy.

When I comment out the line

'scrapy.pipelines.files.FilesPipeline': 2,

nothing will be downloaded.

I am confused...

Upvotes: 4

Views: 1036

Answers (1)

malberts
malberts

Reputation: 2536

Your problem is your custom pipeline is not a real file pipeline, therefore it does nothing. You need to subclass the original FilesPipeline and then use only PrangerPipeline in the settings.

For example:

pipelines.py:

from scrapy.pipelines.files import FilesPipeline

class PrangerPipeline(FilesPipeline):

    # Don't override process_item. The parent class handles it.

    def file_path(self, request, response=None, info=None):
        # ...

    def get_media_requests(self, item, info):
        # ...

settings.py:

ITEM_PIPELINES = {
    'pranger.pipelines.PrangerPipeline': 1,
}

Refer to my examples using ImagesPipeline here:

Unable to rename downloaded images through pipelines without the usage of item.py

Trouble renaming downloaded images in a customized manner through pipelines

Upvotes: 4

Related Questions