Scrapy - How to avoid Pagination Blackhole?

Question

I was recently working on a website spider and noticed it was requesting an infinite number of pages because a site hadn't coded their pagination to ever stop.

So while they only had a few pages of content, it still would generate a next link and a url ...?page=400, ...?page=401, etc.

The content didn't change, just the URL. Is there a way to make Scrapy stop following pagination when content stops changing? Or something I could code up custom.

Granitosaurus · Accepted Answer

If the content doesn't change you can compare the content of the current page with the previous page and if it's the same, break the crawl.

for example:

def parse(self, response):
    product_urls = response.xpath("//a/@href").extract()
    # check last page
    if response.meta.get('prev_urls') == product_urls:
        logging.info('reached the last page at: {}'.format(response.url))
        return  # reached the last page
    # crawl products
    for url in product_urls:
        yield Request(url, self.parse_product)
    # create next page url
    next_page = response.meta.get('page', 0) + 1
    next_url = re.sub('page=\d+', 'page={}'.format(next_page), response.url)
    # now for the next page carry some data in meta
    yield Request(next_url, 
                  meta={'prev_urls': product_urls,
                        'page': next_page}

Scrapy - How to avoid Pagination Blackhole?

Answers (1)

Related Questions