Scrapy limit pages crawled is not working

Question

I have a simple crawler which crawls all links on a website. I need to limit it based on a command line argument (e.g boundary=3). My issue is I cannot get the CLOSESPIDER_ITEMCOUNT working. In settings.py I added EXTENSIONS = {'scrapy.extensions.closespider.CloseSpider': 1} but it still crawls all links on my simple website instead of stopping after 1

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
import os


class FollowAllSpider(CrawlSpider):
    custom_settings = {"CLOSESPIDER_ITEMCOUNT": 1, "CONCURRENT_REQUEST": 1}

    name = 'follow_all'
    allowed_domains = ['testdomain.com']

    start_urls = ['https://www.testdomain.com/simple-website/']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
        dirname = os.path.dirname(__file__)
        filename = response.url.split("/")[-1] + '.html'
        filePath = os.path.join(dirname, "pages/", filename)
        with open(filePath, 'wb') as f:
            f.write(response.body)
        return

renatodvc · Accepted Answer

If you want to limit the number of pages crawled you should use CLOSESPIDER_PAGECOUNT not CLOSESPIDER_ITEMCOUNT.

Also worth noting that your spider doesn't yield any items, so if you were to use CLOSESPIDER_ITEMCOUNT, there is no item to be counted, since you are directly writing in a file.

You can read more on CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ITEMCOUNT by clicking the links.

One last thing, when using CLOSESPIDER_PAGECOUNT there is the following caveat you should be aware, as your results may not match your expectations: https://stackoverflow.com/a/34535390/11326319

Scrapy limit pages crawled is not working

Answers (1)

Related Questions