user3156769
user3156769

Reputation: 11

scrapy doesn't download images

I am trying to download images from different urls via scrapy. I'm new to python and scrapy so maybe I'm missing something obvious. This is my first post on stack overflow. Help would be really appreciated!

Here are my different files :

items.py

from scrapy.item import Item, Field

class ImagesTestItem(Item):
    image_urls = Field()
    image_names =Field()
    images = Field()
    pass

setting.py:

from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)

BOT_NAME = 'images_test'

SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True

spider:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver

logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()

class images_test(CrawlSpider):
    name = "images_test"
    allowed_domains = ['veranstaltungszentrum.bbaw.de']
    start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9)  ]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select()
        number = 0
        for site in sites:    
            xpath = '//img/@src'
            image_urls = hxs.select('//img/@src').extract()
            item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
            items.append(item)
            number = number + 1
            return item

        print item['image_urls']

pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image 
from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR

class images_test(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

the log is saying the following:

/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives):
    STATS_ENABLED: no longer supported (change STATS_CLASS instead)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2376,
     'downloader/request_count': 9,
     'downloader/request_method_count/GET': 9,
     'downloader/response_bytes': 343660,
     'downloader/response_count': 9,
     'downloader/response_status_count/200': 8,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
     'log_count/DEBUG': 15,
     'log_count/ERROR': 1,
     'log_count/INFO': 3,
     'log_count/WARNING': 1,
     'response_received_count': 9,
     'scheduler/dequeued': 9,
     'scheduler/dequeued/memory': 9,
     'scheduler/enqueued': 9,
     'scheduler/enqueued/memory': 9,
     'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)

How come images are not getting saved? Even my print item['image_urls'] command is not being executed.

Thank you

Upvotes: 1

Views: 1501

Answers (2)

Shahzaib Chadhar
Shahzaib Chadhar

Reputation: 327

You can try out without piplines:

def parse(self,response):
   #extract your images url
   imageurl = response.xpath("//img/@src").get()
   imagename = imageurl.split("/")[-1]
   req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
   resource = urllib.request.urlopen(req)
   output = open("foldername/"+imagename,"wb")
   output.write(resource.read())
   output.close()

Upvotes: 0

Guy Gavriely
Guy Gavriely

Reputation: 11396

consider changing your spider code to the following:

start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']

def parse(self, response):
        sel = HtmlXPathSelector(response)
        item = ImagesTestItem()
        url = 'http://veranstaltungszentrum.bbaw.de'
        return item['image_urls'] = [urljoin(url, x) for x in 
                                             sel.select('//img/@src').extract())]

HtmlXPathSelector can only parse html documents, it seem that you fed it with images from your start_urls

Upvotes: 2

Related Questions