
Reputation: 1457

How to download scrapy images in a dyanmic folder based on

I'm trying to override default path full/hash.jpg to <dynamic>/hash.jpg, I've tried How to download scrapy images in a dyanmic folder using following code:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        # here we create the session-path where the files should be in the end
        # you'll have to change this path creation depending on your needs
        slug = slugify(item['category'])
        target_path = os.path.join(slug, os.path.basename(path))

        # try to move the file and raise exception if not possible
        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

but I get:

Traceback (most recent call last):
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    --- <exception caught here> ---
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
    if not os.rename(path, target_path):
    exceptions.OSError: [Errno 2] No such file or directory

I don't know what's wrong, also is there any other way to change the path? Thanks

Upvotes: 4

Views: 5595

Answers (4)


Reputation: 1

To dynamically set the path for images downloaded by a scrapy spider prior to downloading images rather than moving them afterward, I created a custom pipeline overriding the get_media_requests and file_path methods.

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                      'please use file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            url = request
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)

This approach assumes you define a scrapy.Item in your spider and replace, e.g., "field1" with your particular field name. Setting Request.meta in get_media_requests allows item field values to be used in setting download directories for each item, as shown in the return statement for file_path. Scrapy will create the directories automatically if they don't exist.

Custom pipeline class definitions are saved in my project's pipelines.py. Methods here are adapted directly from the default scrapy pipeline images.py, which on my Mac is stored in ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/. Includes and additional methods can be copied from that file as needed.

Upvotes: 0


Reputation: 1482

I have created a pipeline inherited from ImagesPipeline and overridden file_path method and used it instead of standard ImagesPipeline

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)

Upvotes: 7


Reputation: 318

the solution that @neelix give is the best one , but i'm trying to use it and i found some strange results , some documents are moved but not all the documents. So i replaced :

if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

and i imported shutil library , then my code is :

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])

        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        shutil.move(path, target_path)

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

i hope that it will work also for u guys :)

Upvotes: -1


Reputation: 1457

Problem raises because dst folder doesn't exists, and quick solution is:

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])

        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

Upvotes: 3

Related Questions