Reputation: 43
Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls
list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url
and move on to the next restaurant.
Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url
, and then move on to the next start_url
.
Upvotes: 2
Views: 415
Reputation: 166
Dont't use scrapy rules. All what you need:
start_urls = [
'http://url1.com', 'http://url2.com', ...
]
def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse_url)
def parse_url(self, response):
hxs = Selector(response)
item = YourItem()
# process data
return item
And don't forget add all domains to allowed_domains
list.
Upvotes: 1