Reputation: 657
so far I have scraped data from one page. I want to continue until the end of the pagination.
Click Here to view the page
There seems to be a problem because the href contains a javascript element.
<a href="javascript:void(0)" class="next" data-role="next" data-spm-anchor-id="a2700.galleryofferlist.pagination.8">Next</a>
# -*- coding: utf-8 -*-
import scrapy
class AlibabaSpider(scrapy.Spider):
name = 'alibaba'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
def parse(self, response):
for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
item = {
'product_name': products.xpath('.//h2/a/@title').extract_first(),
'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
#'image_url': products.xpath('.//div[@class=""]/').extract_first(),
}
yield item
#Follow the paginatin link
next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
Upvotes: 4
Views: 986
Reputation: 98861
To find and parse all pages in a category, you can use something like:
import re
import requests
base_url = "https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page="
resp = requests.get(base_url)
try :
n_pages = re.findall(r'"pagination":\{\s+"total":(.*?),', resp.text , re.IGNORECASE)
if n_pages:
for page in range(1, int(n_pages[0]) + 1):
url = "{}{}".format(base_url, page)
# do the parsing in this block using the dynamic generated url's
# https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1
# ...
# https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=68
except Exception as e:
print ("Cannot find/parse the total number of pages", e)
# general except, needs improvment in error handling
Upvotes: 1
Reputation: 10666
You can use similar code to get next page URL:
next_page_url = response.xpath('//div[@class="ui2-pagination-pages"]/span[@class="current"]/following-sibling::a[1][contains(@href, "?page=")]/@href').extract_first()
but this will not work because pagination block is rendered by Javascript :-(
But you can use some kind of trick:
next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
Upvotes: 2