SIM
SIM

Reputation: 22440

Dealing with a paginated site using scrapy in a different manner

I've written a script in python using scrapy to parse some information from a webpage. The data available in that webpage traverse through pagination. If I go for using response.follow() then I can get it done. However, I would like to follow the logic I implemented in requests with BeautifulSoup within scrapy but can't find any idea.

Using requests along with BeautifulSoup I could come up with this which is doing just fine:

import requests
from bs4 import BeautifulSoup

page = 0 
URL = 'http://esencjablog.pl/page/{}/'

while True:
    page+=1
    res = requests.get(URL.format(page))
    soup = BeautifulSoup(res.text,'lxml')
    items = soup.select('.post_more a.qbutton')
    if len(items)<=1:break

    for a in items:
        print(a.get("href"))

I would like to do the same using scrapy following the logic I applied above but every time I try to perform it, I end up doing something like below:

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['http://esencjablog.pl/page/{}/'.format(page) for page in range(1,63)] #I used 63 here because the highest page number is 62

    def parse(self, response):
        for link in response.css('.post_more a.qbutton'):
            yield{"link":link.css('::attr(href)').extract_first()}

Once again: my question is If I wish to do the way in scrapy what I already tried with requests and BeautifulSoup when the last page number is unknown then how would the structure be?

Upvotes: 0

Views: 785

Answers (3)

Tarun Lalwani
Tarun Lalwani

Reputation: 146510

In that case you can't take advantage of parallel downloads, but since you want to simulate the same thing in Scrapy this can be achieved in different ways

Approach 1 - Yield page using page numbers

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        # we commnicate the page numbers using request meta
        # this is not mandatory as we can extract the same data from 
        # the response.url also. But I prefer using meta here

        page_no = response.meta.get('page', 1) + 1

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        if items:
            # if items were found we move to the next page
            yield Request("http://esencjablog.pl/page/{}".format(page_no), meta={"page": page_no}, callback=self.parse)

The ideal way would usually be that if you can find the last page count from the first request then you will extract that number and fire all the request in one in first parse call. But that would only work if it is possible to know the last page number

Approach 2 - Yield next page using object

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']


    def parse(self, response):
        items = response.css('.post_more a.qbutton')
        for link in items:
            yield{"link":link.css('::attr(href)').extract_first()}

        next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
        if next_page:
            yield response.follow(next_page) # follow to next page, and parse again

This is nothing but a blunt copy of what @Konstantin mentioned. Sorry but want to make this a more complete answer

Approach 3 - Yield all page on first response

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'

    # Start with page #1
    start_urls = ['http://esencjablog.pl/page/1/']
    first_request =  True

    def parse(self, response):
        if self.first_request:
            self.first_request = False
            last_page_num = response.css("fa-angle-double-right::href").re_first("(\d+)/?$")

            # yield all the pages on first request so we take advantage to parallel downloads
            for page_no in range(2, last_page_num + 1):
                yield Request("http://esencjablog.pl/page/{}".format(page_no), callback=self.parse)

        items = response.css('.post_more a.qbutton')
        for link in items:
            yield {"link":link.css('::attr(href)').extract_first()}

The best point in this approach is that you browse first page, then you check the last page count, and yield all the pages so simultaneous downloads happen. The first 2 approach are more sequential in nature and you would only follow them if you don't want to load the site much at all. The ideal approach for a scraper is Approach 3.

Now regarding of the use of meta object, it is well explained on below link

https://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

Adding the same here for reference

Passing additional data to callback functions

The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument.

Example:

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.

Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

Upvotes: 4

Konstantin
Konstantin

Reputation: 547

You can iterate through the pages like this scrapy doc:

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['http://esencjablog.pl/page/1/'] # go to first page

    def parse(self, response):
        for link in response.css('.post_more a.qbutton'):
            yield{"link":link.css('::attr(href)').extract_first()}

        next_page = response.xpath('//li[contains(@class, "next_last")]/a/@href')
        if next_page:
            yield response.follow(next_page) # follow to next page, and parse again

Upvotes: 0

VMRuiz
VMRuiz

Reputation: 1981

You have to use scrapy.Request for that:

class PaginationTestSpider(scrapy.Spider):
    name = 'pagination'
    start_urls = ['http://esencjablog.pl/page/58']

    def parse(self, response):
        # Find href from next page link
        link = response.css('.post_more a.qbutton::attr(href)') 
        if link:
            # Extract href, in this case we can use first because you only need 1
            href = link.extract_first()
            # just in case the website use relative hrefs
            url = response.urljoin(href)
            # You may change the callback if you want to use a different method 
            yield scrapy.Request(url, callback=self.parse) 

You can find more details in the scrapy documentation

Upvotes: 0

Related Questions