PeterWu
PeterWu

Reputation: 49

scrapy only downloads one image

I was trying to use the ImagesPipeline to download images, and the result was that I was able to get only one picture (the last one); see the screenshot:

screenshot

My target website is https://sc.chinaz.com/tupian/

You can check my code:

#This is spider:

import scrapy
from imgPro.items import ImgproItem
from PIL import Image


class ImgSpider(scrapy.Spider):
    name = 'img'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://sc.chinaz.com/tupian/']


    def parse(self, response):
        div_list = response.xpath('//*[@id="container"]/div')
        # print(div_list)
        # url = 'https://sc.chinaz.com'

        for div in div_list:
            
            img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
            print(img_src)

        item = ImgproItem()
        item['src'] = img_src

        
        yield item

Here is my pipeline:

import scrapy

from scrapy.pipelines.images import ImagesPipeline

class imagePipeLine(ImagesPipeline):

    def get_media_requests(self, item, info):
       yield scrapy.Request(item['src'])

    def file_path(self, request, response=None, info=None, *, item=None):
        imag_name = request.url.split('/')[-1]
        return imag_name

    def item_completed(self, results, item, info):
        return item

What should I change to get all the images?

Upvotes: 0

Views: 99

Answers (1)

In parse() in the for-loop, you go through the list of all images but after the loop, only the last one is saved in img_src and you never come back to the previous ones. So, you either need to process every image as soon as you get its respective img_src:

for div in div_list:        
    img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
    print(img_src)
    # now process this image

or to save all of them in a list and process the whole list later:

all_img_srcs = []
for div in div_list:
    img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
    print(img_src)
    all_img_srcs.append(img_src)

# now process all the images on the list

or maybe

def parse(self, response):
    div_list = response.xpath('//*[@id="container"]/div')

    items = []
    for div in div_list:            
        img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
        print(img_src)
        item = ImgproItem()
        item['src'] = img_src
        items.append(item)   

    yield items

Upvotes: 1

Related Questions