SIM
SIM

Reputation: 22440

Trouble downloading images using scrapy

I've written a script in python scrapy to download some images from a website. When i run my script, I can see the link of images (all of them are in .jpg format) in the console. However, when I open the folder in which the images are supposed to be saved when the downloading is done, I get nothing in there. Where I'm making mistakes?

This is my spider (I'm running from sublime text editor):

import scrapy
from scrapy.crawler import CrawlerProcess

class YifyTorrentSpider(scrapy.Spider):
    name = "yifytorrent"

    start_urls= ['https://www.yify-torrent.org/search/1080p/']

    def parse(self, response):
        for q in response.css("article.img-item .poster-thumb"):
            image = response.urljoin(q.css("::attr(src)").extract_first())
            yield {'':image}

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',   
})
c.crawl(YifyTorrentSpider)
c.start()

This is what I've defined in settings.py for the images to be saved:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/Desktop/torrentspider/torrentspider/spiders/Images"

To make things clearer:

  1. The folder in which I'm expecting the images to be saved named as Images which I've placed in the spider folder under the project torrentspider.
  2. Actual address to the Images folder is C:\Users\WCS\Desktop\torrentspider\torrentspider\spiders.

It's not about running the script successfully with the help of items.py file. So, any solution to make the download happen with the use of items.py file is not what I'm looking for.

Upvotes: 7

Views: 1170

Answers (2)

gusridd
gusridd

Reputation: 874

The item you are yielding does not follow the documentation of Scrapy. As detailed in their media pipeline documentation the item should have a field called image_urls. You should change your parse method to something similar to this.

def parse(self, response):
    images = []
    for q in response.css("article.img-item .poster-thumb"):
        image = response.urljoin(q.css("::attr(src)").extract_first())
        images.append(image)
    yield {'image_urls': images} 

I just tested this and it works. Additionally, as commented by Pruthvi Kumar, the IMAGES_STORE should just be like

IMAGES_STORE = 'Images'

Upvotes: 3

Pruthvi Kumar
Pruthvi Kumar

Reputation: 898

What strikes me first thing scanning the code above is the PATH for IMAGES_STORE. the / means that you are going to the absolute root path of your machine, so you either put the absolute path to where you want to save or just do a relative path from where you are running your crawler

I'm on a linux machine so my absolute path will be something like IMAGES_STORE = /home/pk/myProjects/scraper/images

OR

IMAGES_STORE = 'images'

Also, most importantly, if you are using default pipeline, the variable which holds the extracted image, (where you do extract_first()) must literally be image_urls.

You are also missing a couple of steps. In your spider, add this:

class ImgData(Item):
    image_urls=scrapy.Field()
    images=scrapy.Field()

In the yield step, modify to:

yield ImgData(image_urls=response.urljoin(q.css("::attr(src)").extract_first()))

Upvotes: 0

Related Questions