gunesevitan
gunesevitan

Reputation: 965

Scrapy - Stopping crawler when duplicate item encountered

There are lots of websites I have to hard code page follow (incrementing page number after crawling items) and some of those websites return to page 1 after the last page. For example if a website has 25 pages of items, sending a request to the 26th page yields a response of first page.

At that point duplicate filter of Scrapy works fine and doesn't scrape items, but the crawler keeps running. Is there any way to stop crawling process when duplicate filter is triggered like this?

I don't want to hardcode the page number like this since it can change over time.

if self.page < 25:
    yield scrapy.Request(...)

Upvotes: 1

Views: 309

Answers (1)

Gallaecio
Gallaecio

Reputation: 3857

  1. Configure your request not to be filtered out by the duplicate filter (add dont_filter=True to the request constructor)

  2. Use the request callback to stop the crawler (raise scrapy.exceptions.CloseSpider) when response.url is unexpectedly the URL of the first page

Upvotes: 1

Related Questions