kynnemall
kynnemall

Reputation: 888

Scrapy 2.4.0 rename images in pipeline

I've only started learning Scrapy from tutorials but I've got a spider which successfully downloads images from a website but I've been unable to rename the images using other SO answers. I notice that most answers are over 4 years old and have given me deprecation warnings when I run them, so I would like to know how to fix my pipeline to avoid such warnings.

Can someone please explain to me how I can fix my pipeline class to rename the images?

class ImagetestPipeline(ImagesPipeline):
    
    CONVERTED_ORIGINAL = re.compile('^full/[0-9,a-f]+.jpg$')

    # name information coming from the spider, in each item
    # add this information to Requests() for individual images downloads
    # through "meta" dictionary
    def get_media_requests(self, item, info):
        print("get_media_requests")
        yield [Request(x, meta={'image_name': item["image_names"]})
                for x in item.get('image_urls', [])]

    # this is where the image is extracted from the HTTP response
    def get_images(self, response, request, info):
        print("get_images")

        for key, image, buf, in super().get_images(response, request, info):
            if self.CONVERTED_ORIGINAL.match(key):
                key = self.change_filename(key, response)
            yield key, image, buf

    def change_filename(self, key, response):
        newname = response.meta['image_name'][0]
        return f"{newname}.jpg"
    
    def file_path(self, request, response=None, info=None):
        """This is the method used to determine file path"""
        path = super().file_path(request, response, info)
        return path.replace('full', '')

EDIT

class ImagetestPipeline(ImagesPipeline):
    
    def process_item(self, item, spider):
        self.product_name = spider.product_name
        return item
    
    def file_path(self, request, response=None, info=None):
        fileName = self.product_name
        fileExtension = fileName.split('.')[-1] # Get the file extension (e.g. .jpg, .png)
        return fileName + '.' + fileExtension

Upvotes: 0

Views: 382

Answers (2)

kynnemall
kynnemall

Reputation: 888

I found a solution after much searching. Using the code below in my custom pipeline, which inherits from Scrapy's ImagesPipeline, and defining image_name as a Field in my custom item, I can now rename the images as I want.

def get_media_requests(self, item, info):
    return [Request(x, meta={'image_name': item["image_name"]}) 
            for x in item.get('image_urls', [])]

def file_path(self, request, response=None, info=None):
    return f'{request.meta["image_name"]}.jpg'

Upvotes: 1

Anon Nymous
Anon Nymous

Reputation: 25

You will have to create a custom pipeline that extends the default ImagesPipeline of Scrapy. I can see you already did this in your code, by inheriting the Class ImagesPipeline.

The next step, according to the documentation would be to use the method file_path and then return the desired file path you would like to use.

For example:

    def file_path(self, request, response=None, info=None):
        fileName = request.url.split('/')[-1] # Used to extract the file extension
        fileExtension = fileName.split('.')[-1] # Get the file extension (e.g. .jpg, .png)
        return 'newfilename.' + fileExtension

UPDATE: EDIT FILE NAME ACCORDING TO SCRAPED REQUEST

To achieve this, all I did was simple create a property within the Spider class like so:

import scrapy

class YourSpider(scrapy.Spider):
    productName = ''

Then, during the scraping process you change this name to whatever has been scraped from the request.

    def parse(request):
        // Get the product name
        productName = request.css('.product-title::text').get()
        // Update the Class property accordingly
        self.productName = productName

Now that you've added it to the Spider's Class, you can access it within your Pipeline, like so (apply this to the first solution I posted above as that is the pipeline for changing images.. below is only an arbitrary pipeline to demonstrate):

class MyPipeline:
    def process_item(self, item, spider, info):
        // Do pipeline stuff
        // ..
        // Get the product name through the Spiders Class
        productName = info.spider.productName

Upvotes: 1

Related Questions