Setting rules with scrapy crawlspider

Question

I'm trying out the scrapy CrawlSpider subclass for the first time. I've created the following spider strongly based on the docs example at https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example:

class Test_Spider(CrawlSpider):

    name = "test"

    allowed_domains = ['http://www.dragonflieswellness.com']
    start_urls = ['http://www.dragonflieswellness.com/wp-content/uploads/2015/09/']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow='.jpg'), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        print(response.url)

I'm trying to get the spider to loop start at the prescibed directory and then extract all the '.jpg' links in the directory, but I see :

2016-09-29 13:07:35 [scrapy] INFO: Spider opened
2016-09-29 13:07:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-09-29 13:07:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-29 13:07:36 [scrapy] DEBUG: Crawled (200)  (referer: None)
2016-09-29 13:07:36 [scrapy] INFO: Closing spider (finished)

How can I get this working?

mihal277 · Accepted Answer

First of all, the purpose of using rules is to not only extract links, but, above all, follow them. If you just want to extract links (and, say, save them for later), you don't have to specify spider rules. If you, on the other hand, want to download the images, use a pipeline.

That said, the reason the spider does not follow the links is hidden in the implementation of LinkExtractor:

# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
'm4a',

    # office suites
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg',
'odp',

    # other
    'css', 'pdf', 'exe', 'bin', 'rss', 'zip', 'rar',
]

EDIT:

In order to download images using ImagesPipeline in this example:

Add this to settings:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

IMAGES_STORE = '/home/user/some_directory' # use a correct path

Create a new item:

class MyImageItem(Item):
    images = Field()
    image_urls = Field()

Modify your spider (add a parse method):

    def parse(self, response):
        loader = ItemLoader(item=MyImageItem(), response=response)
        img_paths = response.xpath('//a[substring(@href, string-length(@href)-3)=".jpg"]/@href').extract()
        loader.add_value('image_urls', [self.start_urls[0] + img_path for img_path in img_paths])
        return loader.load_item()

The xpath searches for all hrefs that end with ".jpg" and the extract() method creates a list thereof.

A loader is an additional feature that simplifies creating object, but you could do without it.

Note that I'm no expert and there might be a better, more elegant solution. This one, however, works fine.

Setting rules with scrapy crawlspider

Answers (1)

Related Questions