Reputation: 888
I've only started learning Scrapy from tutorials but I've got a spider which successfully downloads images from a website but I've been unable to rename the images using other SO answers. I notice that most answers are over 4 years old and have given me deprecation warnings when I run them, so I would like to know how to fix my pipeline to avoid such warnings.
Can someone please explain to me how I can fix my pipeline class to rename the images?
class ImagetestPipeline(ImagesPipeline):
CONVERTED_ORIGINAL = re.compile('^full/[0-9,a-f]+.jpg$')
# name information coming from the spider, in each item
# add this information to Requests() for individual images downloads
# through "meta" dictionary
def get_media_requests(self, item, info):
print("get_media_requests")
yield [Request(x, meta={'image_name': item["image_names"]})
for x in item.get('image_urls', [])]
# this is where the image is extracted from the HTTP response
def get_images(self, response, request, info):
print("get_images")
for key, image, buf, in super().get_images(response, request, info):
if self.CONVERTED_ORIGINAL.match(key):
key = self.change_filename(key, response)
yield key, image, buf
def change_filename(self, key, response):
newname = response.meta['image_name'][0]
return f"{newname}.jpg"
def file_path(self, request, response=None, info=None):
"""This is the method used to determine file path"""
path = super().file_path(request, response, info)
return path.replace('full', '')
EDIT
class ImagetestPipeline(ImagesPipeline):
def process_item(self, item, spider):
self.product_name = spider.product_name
return item
def file_path(self, request, response=None, info=None):
fileName = self.product_name
fileExtension = fileName.split('.')[-1] # Get the file extension (e.g. .jpg, .png)
return fileName + '.' + fileExtension
Upvotes: 0
Views: 382
Reputation: 888
I found a solution after much searching. Using the code below in my custom pipeline, which inherits from Scrapy's ImagesPipeline, and defining image_name
as a Field
in my custom item, I can now rename the images as I want.
def get_media_requests(self, item, info):
return [Request(x, meta={'image_name': item["image_name"]})
for x in item.get('image_urls', [])]
def file_path(self, request, response=None, info=None):
return f'{request.meta["image_name"]}.jpg'
Upvotes: 1
Reputation: 25
You will have to create a custom pipeline that extends the default ImagesPipeline of Scrapy. I can see you already did this in your code, by inheriting the Class ImagesPipeline.
The next step, according to the documentation would be to use the method file_path
and then return the desired file path you would like to use.
For example:
def file_path(self, request, response=None, info=None):
fileName = request.url.split('/')[-1] # Used to extract the file extension
fileExtension = fileName.split('.')[-1] # Get the file extension (e.g. .jpg, .png)
return 'newfilename.' + fileExtension
UPDATE: EDIT FILE NAME ACCORDING TO SCRAPED REQUEST
To achieve this, all I did was simple create a property within the Spider class like so:
import scrapy
class YourSpider(scrapy.Spider):
productName = ''
Then, during the scraping process you change this name to whatever has been scraped from the request.
def parse(request):
// Get the product name
productName = request.css('.product-title::text').get()
// Update the Class property accordingly
self.productName = productName
Now that you've added it to the Spider's Class, you can access it within your Pipeline, like so (apply this to the first solution I posted above as that is the pipeline for changing images.. below is only an arbitrary pipeline to demonstrate):
class MyPipeline:
def process_item(self, item, spider, info):
// Do pipeline stuff
// ..
// Get the product name through the Spiders Class
productName = info.spider.productName
Upvotes: 1