tony.crete
tony.crete

Reputation: 229

Scrapy file download using custom path based on item

what I would like to do is pretty basic I think but I couldn't find a way to implement it.

I am trying to use the FilesPipeline in scrapy in order to download a file (ex. Image1.jpg) and save it on a path relative to the item which placed that request in the first place (ex. item.name).


It is pretty similar with this question here, though I want to pass as an argument the item.name or item.something field, in order to save each file in a custom path depending on the item.name.

The path is defined in the persist_file function, but that function does not have access to the item itself, just the file request and response.

def get_media_requests(self, item, info):
    return [Request(x) for x in item.get(self.FILES_URLS_FIELD, [])]

I can also see above, that the request is made here in order to process the files into the pipeline, but is there a way to pass an extra argument in order to later use it on the file_downloaded and afterwards on persist_file function?

As a final solution, it would be pretty simple to rename/move the file after it has been downloaded in one of the following pipelines but it seems sloppy, isn't it?

I am using the code implemented here as a custom pipeline.

Can anyone help please? Thank you in advance :)

Upvotes: 2

Views: 1046

Answers (1)

eLRuLL
eLRuLL

Reputation: 18799

Create your own pipeline (inherited from FilesPipeline) overriding the process_item method of the pipeline, to pass the current item to the other functions

def process_item(self, item, spider):
    info = self.spiderinfo
    requests = arg_to_iter(self.get_media_requests(item, info))
    dlist = [self._process_request(r, info, item) for r in requests]
    dfd = DeferredList(dlist, consumeErrors=1)
    return dfd.addCallback(self.item_completed, item, info)

then you need to override _process_request and keep passing the item argument to use it for when creating the file path.

Upvotes: 1

Related Questions