Struggling with Scrapy pagination

Question

At the moment have got a bit of the Frankenstein code (consisting of Beautifulsoup and Scrapy parts) that seem to be doing a job in terms of the reading the info from page 1 urls. Shall try to redo everything in Scrapy as soon as pagination issue resolved.

So what codes is meant to do:

Read all subcategories (BeautifulSoup part)

The rest are Scrapy code parts

Using the above urls read sub-subcategories.
Extract the last page number and loop over the above urls.
Extract the necessary product info from the above urls.

All except part 3 do seem to work.

Have tried to use the below code to extract the last page number but not sure how to integrate it into the main code

def parse_paging(self, response):
        try:
            for next_page in ('?pn=1' + response.xpath('//ul[@class="pagination pull-left"]/noscript/a/text()').extract()[-1]):
                print(next_page)
#                yield scrapy.Request(url=response.urljoin(next_page))
        except:
            pass

The below is the main code.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess

category_list = []
sub_category_url = []

root_url = 'https://uk.rs-online.com/web'
page = requests.get(root_url)
soup = BeautifulSoup(page.content, 'html.parser')
cat_up = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionUp')]
category_up = [item for sublist in cat_up for item in sublist]
cat_down = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionDown')]
category_down = [item for sublist in cat_down for item in sublist]
for c_up in category_up:
    sub_category_url.append('https://uk.rs-online.com' + c_up['href'])
for c_down in category_down:
    sub_category_url.append('https://uk.rs-online.com' + c_down['href'])
#   print(k)


class subcategories(scrapy.Spider):
    name = 'subcategories'

    def start_requests(self):
        urls = sub_category_url
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        products = response.css('div.card.js-title a::href').extract() #xpath("//div[contains(@class, 'js-tile')]/a/@href").
        for p in products:
            url = urljoin(response.url, p)
            yield scrapy.Request(url, callback=self.parse_product)
    def parse_product(self, response):
        for quote in response.css('tr.resultRow'):
            yield {
                'product': quote.css('div.row.margin-bottom a::text').getall(),
                'stock_no': quote.css('div.stock-no-label a::text').getall(),
                'brand': quote.css('div.row a::text').getall(),
                'price': quote.css('div.col-xs-12.price.text-left span::text').getall(),
                'uom': quote.css('div.col-xs-12.pack.text-left span::text').getall(),
            }
process = CrawlerProcess()
process.crawl(subcategories)
process.start()

Would be exceptionally grateful if you could provides any hints on how to deal with the above issue.

Let me know if you have any questions.

Struggling with Scrapy pagination

Answers (1)

Related Questions