Reputation: 435
I have a simple crawler which crawls all links on a website. I need to limit it based on a command line argument (e.g boundary=3). My issue is I cannot get the CLOSESPIDER_ITEMCOUNT
working. In settings.py I added EXTENSIONS = {'scrapy.extensions.closespider.CloseSpider': 1}
but it still crawls all links on my simple website instead of stopping after 1
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
import os
class FollowAllSpider(CrawlSpider):
custom_settings = {"CLOSESPIDER_ITEMCOUNT": 1, "CONCURRENT_REQUEST": 1}
name = 'follow_all'
allowed_domains = ['testdomain.com']
start_urls = ['https://www.testdomain.com/simple-website/']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
dirname = os.path.dirname(__file__)
filename = response.url.split("/")[-1] + '.html'
filePath = os.path.join(dirname, "pages/", filename)
with open(filePath, 'wb') as f:
f.write(response.body)
return
Upvotes: 1
Views: 730
Reputation: 2564
If you want to limit the number of pages crawled you should use CLOSESPIDER_PAGECOUNT
not CLOSESPIDER_ITEMCOUNT
.
Also worth noting that your spider doesn't yield
any items, so if you were to use CLOSESPIDER_ITEMCOUNT
, there is no item to be counted, since you are directly writing in a file.
You can read more on CLOSESPIDER_PAGECOUNT and CLOSESPIDER_ITEMCOUNT by clicking the links.
One last thing, when using CLOSESPIDER_PAGECOUNT
there is the following caveat you should be aware, as your results may not match your expectations: https://stackoverflow.com/a/34535390/11326319
Upvotes: 1