Reputation: 423
This is a follow-up question to my initial question. I want to download PDFs and save them on harddisk with custom filename.
For the custom filename I tried this code in my pipelines.py
according to this recommendation:
class PrangerPipeline(object):
def process_item(self, item, spider):
return item
def file_path(self, request, response=None, info=None):
original_path = super(PrangerPipeline, self).file_path(request, response=None, info=None)
sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
return request.meta.get('filename','') + "_" + sha1_and_extension
def get_media_requests(self, item, info):
file_url = item['file_url']
meta = {'filename': item['name']}
yield Request(url=file_url, meta=meta)
In my settings.py
I have:
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 1,
'scrapy.pipelines.files.FilesPipeline': 2,
}
But the files keep being saved only with their SHA1-hash, for example: a8569143c987cdd43dd1f6d9a6f98b7aa6fbc284.PDF. So my custom file_path
function seems not to be used by Scrapy.
When I comment out the line
'scrapy.pipelines.files.FilesPipeline': 2,
nothing will be downloaded.
I am confused...
Upvotes: 4
Views: 1036
Reputation: 2536
Your problem is your custom pipeline is not a real file pipeline, therefore it does nothing. You need to subclass the original FilesPipeline
and then use only PrangerPipeline
in the settings.
For example:
pipelines.py
:
from scrapy.pipelines.files import FilesPipeline
class PrangerPipeline(FilesPipeline):
# Don't override process_item. The parent class handles it.
def file_path(self, request, response=None, info=None):
# ...
def get_media_requests(self, item, info):
# ...
settings.py
:
ITEM_PIPELINES = {
'pranger.pipelines.PrangerPipeline': 1,
}
Refer to my examples using ImagesPipeline
here:
Unable to rename downloaded images through pipelines without the usage of item.py
Trouble renaming downloaded images in a customized manner through pipelines
Upvotes: 4