Reputation: 268
I have been trying to use Scrapy to crawl Imgur for images but I'm running into problems.
The spider seems to work fine but it's failing to go on to the site and do the work.
I cant find the place I screw up..
items.py
import scrapy
class ImgurItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
setting.py
BOT_NAME = 'imgur'
SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '\Users\123\Desktop\images'
imgur_spider.py
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['imgur.com']
start_url = ['http://imgur.com']
rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath("//h2[@id='image-title']/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = ['http:'+rel[0]]
return image
Log
2015-10-16 16:36:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: imgur)
2015-10-16 16:36:50 [scrapy] INFO: Optional features available: ssl, http11
2015-10-16 16:36:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'imgur.spiders', 'SPIDER_MODULES': ['imgur.spiders'], 'BOT_NAME': 'imgur'}
2015-10-16 16:36:50 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-16 16:36:50 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-16 16:36:50 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-16 16:36:50 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2015-10-16 16:36:50 [scrapy] INFO: Spider opened
2015-10-16 16:36:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-16 16:36:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-16 16:36:50 [scrapy] INFO: Closing spider (finished)
2015-10-16 16:36:50 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 16, 15, 36, 50, 469000),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 10, 16, 15, 36, 50, 462000)}
2015-10-16 16:36:50 [scrapy] INFO: Spider closed (finished)
Upvotes: 1
Views: 298
Reputation: 4491
Your crawler is going to https://i.sstatic.net/Rlnf7.jpg, then looking for all links containing "/gallery/" (the extra .* isn't needed). It isn't finding any (because there are none on the page), and so the crawler finishes.
EDIT:
After cleaning up the code, it jumped out that start_url = ...
should be start_urls = ...
. As is, Scrapy is grabbing nothing because it isn't being provided with anywhere to start.
Upvotes: 1