Reputation: 109
I came across a paging difficulty with Scrapy. I usually used the following code successfully
next_page = response.xpath("//div//div[4]//ul[1]//li[10]//a[1]//@href").extract_first()
if next_page is not None:
yield scrapy.Request(url = response.urljoin(next_page), callback=self.parse)
It turns out that in this attempt, I came across a website that uses blocks of 5 pages. See image below.
So, after capturing the first 5 pages, Scrapy jumps to the penultimate page (526).
The paging structure follows the following logic:
https://www.example.com-1-data.html
And it increases numerically. Can anyone help me with the incremental query (based on the example address) for this pagination?
Upvotes: 3
Views: 2628
Reputation: 57
For extract all data from all navigation pages, you can use Scrapy LinkExtractor in rules.
1 Use RegExp
rules = {
Rule(LinkExtractor(allow='.*part-of-url/page-nav/page.*'), callback='parse_page', follow=True)
}
2 Use XPath
rules = {
Rule(LinkExtractor(allow=(), restrict_xpaths='//ul[@class="nav-block"]'), callback='parse_page', follow=True)
}
Upvotes: 2
Reputation: 21406
When it comes to pagination optimal approach really depends on what sort of pagination is being used.
If you:
page
indicates what page your onThen you can schedule all pages at once:
def parse_listings_page1(self, response):
"""
here parse first page, schedule all other pages at once!
"""
# e.g. 'http://shop.com/products?page=1'
url = response.url
# e.g. 100
total_pages = int(response.css('.last-page').extract_first())
# schedule every page at once!
for page in range(2, total_pages + 1):
page_url = add_or_replace_parameter(url, 'page', page)
yield Request(page_url, self.parse_listings)
# don't forget to also parse listings on first page!
yield from self.parse_listings(response)
def parse_listings(self, response):
for url in response.css('.listing::attr(href)'):
yield Request(url, self.parse_product)
The huge benefits of this approach is speed - here you can take of async logic and crawl all pages simultaneously!
Alternatively.
If you:
Then you have to schedule the pages synchronously 1 by 1:
def parse(self, response):
for product in response.css('.product::attr(href)'):
yield Request(product, self.parse_product)
next_page = response.css('.next-page::attr(href)').extract_first()
if next_page:
yield Request(next_page, self.parse)
else:
print(f'last page reached: {response.url}')
In your example your using the second syncronous approach and your fears here are unfounded, you just have to ensure your xpath selector selects the right page.
Upvotes: 7