Reputation: 49
I was trying to use the ImagesPipeline to download images, and the result was that I was able to get only one picture (the last one); see the screenshot:
My target website is https://sc.chinaz.com/tupian/
You can check my code:
#This is spider:
import scrapy
from imgPro.items import ImgproItem
from PIL import Image
class ImgSpider(scrapy.Spider):
name = 'img'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://sc.chinaz.com/tupian/']
def parse(self, response):
div_list = response.xpath('//*[@id="container"]/div')
# print(div_list)
# url = 'https://sc.chinaz.com'
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
item = ImgproItem()
item['src'] = img_src
yield item
Here is my pipeline:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class imagePipeLine(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
def file_path(self, request, response=None, info=None, *, item=None):
imag_name = request.url.split('/')[-1]
return imag_name
def item_completed(self, results, item, info):
return item
What should I change to get all the images?
Upvotes: 0
Views: 99
Reputation: 856
In parse()
in the for-loop, you go through the list of all images but after the loop, only the last one is saved in img_src
and you never come back to the previous ones. So, you either need to process every image as soon as you get its respective img_src:
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
# now process this image
or to save all of them in a list and process the whole list later:
all_img_srcs = []
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
all_img_srcs.append(img_src)
# now process all the images on the list
or maybe
def parse(self, response):
div_list = response.xpath('//*[@id="container"]/div')
items = []
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
item = ImgproItem()
item['src'] = img_src
items.append(item)
yield items
Upvotes: 1