Reputation:
I'm triyng to build scraper with Scapy. I don't undestand why Scrapy doesn't want to go to the next page. I thouht to extract link from pagination area..but, alas. My rule for extracting urls for going to the next page
Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[5]/div[2]/div[5]/div/div[3]/ul',allow=('page=[0-9]*')), follow=True)
Crawler
Class DmozSpider(CrawlSpider):
name = "arbal"
allowed_domains = ["bigbasket.com"]
start_urls = [
"http://bigbasket.com/pc/bread-dairy-eggs/bread-bakery/?nc=cs"
]
rules = (
Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[4]/ul',allow=('pc\/.*.\?nc=cs')), follow=True),
Rule(LinkExtractor(restrict_xpaths='/html/body/div[19]/div[5]/div[2]/div[5]/div/div[3]/ul',allow=('page=[0-9]*')), follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[@id="products-container"]',allow=('pd\/*.+')), callback='parse_item', follow=True)
)
def parse_item(self, response):
item = AlabaItem()
hxs = HtmlXPathSelector(response)
item['brand_name'] = hxs.select('.//*[contains(@id, "slidingProduct")]/div[2]/div[1]/a/text()').extract()
item['prod_name'] = hxs.select('//*[contains(@id, "slidingProduct")]/div[2]/div[2]/h1/text()').extract()
yield item
Upvotes: 2
Views: 5175
Reputation: 473863
There is an AJAX-style pagination which is not easy to follow, but doable.
Using browser developer tools you may see that every time you switch pages, there is an XHR request being sent to http://bigbasket.com/product/facet/get-page/
with sid
and page
parameters:
The problem is that sid
parameter - this is what we'll extract from the first link containing sid
on the page.
The response is in JSON format containing products
key which is basically an HTML code of the products_container
block on a page.
Note that CrawlSpider
would not help in this case. We need to use a regular spider and follow the pagination "manually".
Another question you may have: how would we know how many pages to follow - the idea here would be to extract the total number of products on the page from the "Showing X - Y of Z products" label in the bottom of the page, then divide the total number of products by 20 (20 products per page).
Implementation:
import json
import urllib
import scrapy
class DmozSpider(scrapy.Spider):
name = "arbal"
allowed_domains = ["bigbasket.com"]
start_urls = [
"http://bigbasket.com/pc/bread-dairy-eggs/bread-bakery/?nc=cs"
]
def parse(self, response):
# follow pagination
num_pages = int(response.xpath('//div[@class="noItems"]/span[@class="bFont"][last()]/text()').re(r'(\d+)')[0])
sid = response.xpath('//a[contains(@href, "sid")]/@href').re(r'sid=(\w+)(?!&|\z)')[0]
base_url = 'http://bigbasket.com/product/facet/get-page/?'
for page in range(1, num_pages/20 + 1):
yield scrapy.Request(base_url + urllib.urlencode({'sid': sid, 'page': str(page)}), dont_filter=True, callback=self.parse_page)
def parse_page(self, response):
data = json.loads(response.body)
selector = scrapy.Selector(text=data['products'])
for product in selector.xpath('//li[starts-with(@id, "product")]'):
title = product.xpath('.//div[@class="muiv2-product-container"]//img/@title').extract()[0]
print title
For the page set in start_urls
it prints 281 product titles.
Upvotes: 2