Reputation: 1
I'm trying to scrape a few images from a website. Sorry in advance, I am not very experimented with Python and it is the first time I try using scrapy.
I manage apparently to get all the images I need, but they somehow get lost and my output folder remains empty.
I looked at a few tutorials and all the similar questions I could find on SO, but nothing seemed to really work out.
My spider:
from testspider.items import TestspiderItem
import datetime
import scrapy
class PageSpider(scrapy.Spider):
name = 'page-spider'
start_urls = ['http://scan-vf.co/one_piece/chapitre-807/1']
def parse(self, response):
SET_SELECTOR = '.img-responsive'
page = 1
for imgPage in response.css(SET_SELECTOR):
IMAGE_SELECTOR = 'img ::attr(src)'
imgURL = imgPage.css(IMAGE_SELECTOR).extract_first()
title = 'op-807-' + str(page)
page += 1
yield TestspiderItem({'title':title, 'image_urls':[imgURL]})
My items:
import scrapy
class TestspiderItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
My settings:
BOT_NAME = 'testspider'
SPIDER_MODULES = ['testspider.spiders']
NEWSPIDER_MODULE = 'testspider.spiders'
DEFAULT_ITEM_CLASS = 'testspider.items'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGE_STORE = '/home/*******/documents/testspider/output'
If you could be so kind as to help me understanding what's missing / what's incorrect, I would be grateful
Upvotes: 0
Views: 145
Reputation: 10666
If you check a source code (usually Ctrl+U
in a browser) you'll find that each img
is a something like this:
<img class="img-responsive" src="" data-src=' https://www.scan-vf.co/uploads/manga/one_piece/chapters/chapitre-807/01.jpg ' alt='One Piece: Chapter chapitre-807 - Page 1'/>
As you can see you need to use data-src
in your code instead of src
:
IMAGE_SELECTOR = 'img ::attr(data-src)'
Upvotes: 1