Scrapy stopping conditions

Question

Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url and move on to the next restaurant.

Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url, and then move on to the next start_url.

Dmitry · Accepted Answer

Dont't use scrapy rules. All what you need:

start_urls = [
        'http://url1.com', 'http://url2.com', ...
    ]

def start_requests(self):
        for url in self.start_urls:
            yield Request(url, self.parse_url)

def parse_url(self, response):
        hxs = Selector(response)
        item = YourItem()
        # process data 
        return item

And don't forget add all domains to allowed_domains list.

Scrapy stopping conditions

Answers (1)

Related Questions