user4251615
user4251615

Reputation:

how to get scrapy go to the next page?

I'm triyng to build scraper with Scapy. I don't undestand why Scrapy doesn't want to go to the next page. I thouht to extract link from pagination area..but, alas. My rule for extracting urls for going to the next page

Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[5]/div[2]/div[5]/div/div[3]/ul',allow=('page=[0-9]*')), follow=True)

Crawler

Class DmozSpider(CrawlSpider):
    name = "arbal"
    allowed_domains = ["bigbasket.com"]
    start_urls = [
        "http://bigbasket.com/pc/bread-dairy-eggs/bread-bakery/?nc=cs"
    ]

    rules = (
             Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[4]/ul',allow=('pc\/.*.\?nc=cs')), follow=True),
             Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[5]/div[2]/div[5]/div/div[3]/ul',allow=('page=[0-9]*')), follow=True),
             Rule(LinkExtractor(restrict_xpaths='//*[@id="products-container"]',allow=('pd\/*.+')), callback='parse_item', follow=True)
             )

    def parse_item(self, response):
        item = AlabaItem()
        hxs = HtmlXPathSelector(response)
        item['brand_name'] = hxs.select('.//*[contains(@id, "slidingProduct")]/div[2]/div[1]/a/text()').extract()
        item['prod_name'] =  hxs.select('//*[contains(@id, "slidingProduct")]/div[2]/div[2]/h1/text()').extract()
        yield item

Upvotes: 2

Views: 5175

Answers (1)

alecxe
alecxe

Reputation: 473863

There is an AJAX-style pagination which is not easy to follow, but doable.

Using browser developer tools you may see that every time you switch pages, there is an XHR request being sent to http://bigbasket.com/product/facet/get-page/ with sid and page parameters:

enter image description here

The problem is that sid parameter - this is what we'll extract from the first link containing sid on the page.

The response is in JSON format containing products key which is basically an HTML code of the products_container block on a page.

Note that CrawlSpider would not help in this case. We need to use a regular spider and follow the pagination "manually".

Another question you may have: how would we know how many pages to follow - the idea here would be to extract the total number of products on the page from the "Showing X - Y of Z products" label in the bottom of the page, then divide the total number of products by 20 (20 products per page).

Implementation:

import json
import urllib

import scrapy


class DmozSpider(scrapy.Spider):
    name = "arbal"
    allowed_domains = ["bigbasket.com"]
    start_urls = [
        "http://bigbasket.com/pc/bread-dairy-eggs/bread-bakery/?nc=cs"
    ]

    def parse(self, response):
        # follow pagination
        num_pages = int(response.xpath('//div[@class="noItems"]/span[@class="bFont"][last()]/text()').re(r'(\d+)')[0])
        sid = response.xpath('//a[contains(@href, "sid")]/@href').re(r'sid=(\w+)(?!&|\z)')[0]

        base_url = 'http://bigbasket.com/product/facet/get-page/?'
        for page in range(1, num_pages/20 + 1):
            yield scrapy.Request(base_url + urllib.urlencode({'sid': sid, 'page': str(page)}), dont_filter=True, callback=self.parse_page)

    def parse_page(self, response):
        data = json.loads(response.body)

        selector = scrapy.Selector(text=data['products'])
        for product in selector.xpath('//li[starts-with(@id, "product")]'):
            title = product.xpath('.//div[@class="muiv2-product-container"]//img/@title').extract()[0]
            print title

For the page set in start_urls it prints 281 product titles.

Upvotes: 2

Related Questions