omrcm
omrcm

Reputation: 31

Scrapy parse pagination without next link

I'm trying to parse a pagination without next link. The html is belove:

<div id="pagination" class="pagination">
    <ul>
        <li>
            <a href="//www.demopage.com/category_product_seo_name" class="page-1 ">1</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=2" class="page-2 ">2</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=3" class="page-3 ">3</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=4" class="page-4 active">4</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=5" class="page-5">5</a>
        </li>
        <li>
            <a href="//www.demopage.com/category_product_seo_name?page=6" class="page-6 ">6</a>
        </li>
        <li>
                <span class="page-... three-dots">...</span>
        </li>
        <li>
           <a href="//www.demopage.com/category_product_seo_name?page=50" class="page-50 ">50</a>
        </li>
    </ul>   
</div>

For this html I have try this xpath:

response.xpath('//div[@class="pagination"]/ul/li/a/@href').extract()
or 
response.xpath('//div[@class="pagination"]/ul/li/a/@href/following-sibling::a[1]/@href').extract()

is there a good way to parse this pagination? Thanks for all.

PS: I have checked this answers too:

Answer 1

Answer 2

Upvotes: 1

Views: 640

Answers (2)

Ikram Khan Niazi
Ikram Khan Niazi

Reputation: 807

You can simply get all the pagination links and run it inside the loop every time you have to call the below code and available pagination links will be returned by the selector. You don't need to worry about duplicate URLs as scrapy will handle this one for you. You can also use scrapy Rules as well.

 response.css('.pagination ::attr(href)').getall()

Upvotes: 0

Felix Ekl&#246;f
Felix Ekl&#246;f

Reputation: 3730

One solution is to scrape x number of pages, but this isn't always a good solution if the total number of pages isn't constant:

class MySpider(scrapy.spider):
    num_pages = 10
    def start_requests(self):
        requests = []
        for i in range(1, self.num_pages)
            requests.append(scrapy.Request(
                url='www.demopage.com/category_product_seo_name?page={0}'.format(i)
            ))
        return requests

    def parse(self, response):
        #parse pages here.

Update

You can also keep track of the page count and do something like this. a[href~="?page=2"]::attr(href) will target a elements which href attribute contains the string specified. (I'm not currently able to test if this code works, but something in the style of this should do it)

class MySpider(scrapy.spider):
    start_urls = ['https://demopage.com/search?p=1']
    page_count = 1


def parse(self, response):
     self.page_count += 1
     #parse response

     next_url = response.css('#pagination > ul > li > a[href~="?page={0}"]::attr(href)'.format(self.page_count))
     if next_url:
         yield scrapy.Request(
             url = next_url
         )

Upvotes: 2

Related Questions