Antonio Oliveira
Antonio Oliveira

Reputation: 109

Incremental Pagination in Scrapy / Python

I came across a paging difficulty with Scrapy. I usually used the following code successfully

next_page = response.xpath("//div//div[4]//ul[1]//li[10]//a[1]//@href").extract_first()
    if next_page is not None:
        yield scrapy.Request(url = response.urljoin(next_page), callback=self.parse)

It turns out that in this attempt, I came across a website that uses blocks of 5 pages. See image below.

enter image description here

So, after capturing the first 5 pages, Scrapy jumps to the penultimate page (526).

The paging structure follows the following logic:

https://www.example.com-1-data.html

And it increases numerically. Can anyone help me with the incremental query (based on the example address) for this pagination?

Upvotes: 3

Views: 2628

Answers (2)

Sergey
Sergey

Reputation: 57

For extract all data from all navigation pages, you can use Scrapy LinkExtractor in rules.

1 Use RegExp

rules = {
    Rule(LinkExtractor(allow='.*part-of-url/page-nav/page.*'), callback='parse_page', follow=True)
}

2 Use XPath

rules = {
        Rule(LinkExtractor(allow=(), restrict_xpaths='//ul[@class="nav-block"]'), callback='parse_page', follow=True)
    }

Upvotes: 2

Granitosaurus
Granitosaurus

Reputation: 21406

When it comes to pagination optimal approach really depends on what sort of pagination is being used.

If you:

  • know url page format e.g. that url argument page indicates what page your on
  • know total amount of pages

Then you can schedule all pages at once:

def parse_listings_page1(self, response):
    """
    here parse first page, schedule all other pages at once!
    """
    # e.g. 'http://shop.com/products?page=1'
    url = response.url
    # e.g. 100
    total_pages = int(response.css('.last-page').extract_first())

    # schedule every page at once! 
    for page in range(2, total_pages + 1):
        page_url = add_or_replace_parameter(url, 'page', page)
        yield Request(page_url, self.parse_listings)
    # don't forget to also parse listings on first page!
    yield from self.parse_listings(response)


def parse_listings(self, response):
    for url in response.css('.listing::attr(href)'):
        yield Request(url, self.parse_product)

The huge benefits of this approach is speed - here you can take of async logic and crawl all pages simultaneously!

Alternatively.

If you:

  • don't know anything other than next page url is on the page

Then you have to schedule the pages synchronously 1 by 1:

def parse(self, response):

    for product in response.css('.product::attr(href)'):
        yield Request(product, self.parse_product)

    next_page = response.css('.next-page::attr(href)').extract_first()
    if next_page:
        yield Request(next_page, self.parse)
    else:
        print(f'last page reached: {response.url}')

In your example your using the second syncronous approach and your fears here are unfounded, you just have to ensure your xpath selector selects the right page.

Upvotes: 7

Related Questions