Reputation: 22440
I've written a script in python scrapy to download some images from a website. When i run my script, I can see the link of images (all of them are in .jpg
format) in the console. However, when I open the folder in which the images are supposed to be saved when the downloading is done, I get nothing in there. Where I'm making mistakes?
This is my spider (I'm running from sublime text editor):
import scrapy
from scrapy.crawler import CrawlerProcess
class YifyTorrentSpider(scrapy.Spider):
name = "yifytorrent"
start_urls= ['https://www.yify-torrent.org/search/1080p/']
def parse(self, response):
for q in response.css("article.img-item .poster-thumb"):
image = response.urljoin(q.css("::attr(src)").extract_first())
yield {'':image}
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(YifyTorrentSpider)
c.start()
This is what I've defined in settings.py
for the images to be saved:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "/Desktop/torrentspider/torrentspider/spiders/Images"
To make things clearer:
Images
which I've placed in the spider
folder under the project torrentspider
.Images
folder is C:\Users\WCS\Desktop\torrentspider\torrentspider\spiders
.It's not about running the script successfully with the help of items.py
file. So, any solution to make the download happen with the use of items.py
file is not what I'm looking for.
Upvotes: 7
Views: 1170
Reputation: 874
The item you are yielding does not follow the documentation of Scrapy. As detailed in their media pipeline documentation the item should have a field called image_urls
. You should change your parse method to something similar to this.
def parse(self, response):
images = []
for q in response.css("article.img-item .poster-thumb"):
image = response.urljoin(q.css("::attr(src)").extract_first())
images.append(image)
yield {'image_urls': images}
I just tested this and it works. Additionally, as commented by Pruthvi Kumar, the IMAGES_STORE should just be like
IMAGES_STORE = 'Images'
Upvotes: 3
Reputation: 898
What strikes me first thing scanning the code above is the PATH for IMAGES_STORE
. the /
means that you are going to the absolute root path of your machine, so you either put the absolute path to where you want to save or just do a relative path from where you are running your crawler
I'm on a linux machine so my absolute path will be something like IMAGES_STORE = /home/pk/myProjects/scraper/images
OR
IMAGES_STORE = 'images'
Also, most importantly, if you are using default pipeline, the variable which holds the extracted image, (where you do extract_first()
) must literally be image_urls
.
You are also missing a couple of steps. In your spider, add this:
class ImgData(Item):
image_urls=scrapy.Field()
images=scrapy.Field()
In the yield
step, modify to:
yield ImgData(image_urls=response.urljoin(q.css("::attr(src)").extract_first()))
Upvotes: 0