Reputation: 1
At the moment have got a bit of the Frankenstein code (consisting of Beautifulsoup and Scrapy parts) that seem to be doing a job in terms of the reading the info from page 1 urls. Shall try to redo everything in Scrapy as soon as pagination issue resolved.
So what codes is meant to do:
The rest are Scrapy code parts
Using the above urls read sub-subcategories.
Extract the last page number and loop over the above urls.
All except part 3 do seem to work.
Have tried to use the below code to extract the last page number but not sure how to integrate it into the main code
def parse_paging(self, response):
try:
for next_page in ('?pn=1' + response.xpath('//ul[@class="pagination pull-left"]/noscript/a/text()').extract()[-1]):
print(next_page)
# yield scrapy.Request(url=response.urljoin(next_page))
except:
pass
The below is the main code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess
category_list = []
sub_category_url = []
root_url = 'https://uk.rs-online.com/web'
page = requests.get(root_url)
soup = BeautifulSoup(page.content, 'html.parser')
cat_up = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionUp')]
category_up = [item for sublist in cat_up for item in sublist]
cat_down = [a.find_all('a') for a in soup.find_all('div',class_='horizontalMenu sectionDown')]
category_down = [item for sublist in cat_down for item in sublist]
for c_up in category_up:
sub_category_url.append('https://uk.rs-online.com' + c_up['href'])
for c_down in category_down:
sub_category_url.append('https://uk.rs-online.com' + c_down['href'])
# print(k)
class subcategories(scrapy.Spider):
name = 'subcategories'
def start_requests(self):
urls = sub_category_url
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
products = response.css('div.card.js-title a::href').extract() #xpath("//div[contains(@class, 'js-tile')]/a/@href").
for p in products:
url = urljoin(response.url, p)
yield scrapy.Request(url, callback=self.parse_product)
def parse_product(self, response):
for quote in response.css('tr.resultRow'):
yield {
'product': quote.css('div.row.margin-bottom a::text').getall(),
'stock_no': quote.css('div.stock-no-label a::text').getall(),
'brand': quote.css('div.row a::text').getall(),
'price': quote.css('div.col-xs-12.price.text-left span::text').getall(),
'uom': quote.css('div.col-xs-12.pack.text-left span::text').getall(),
}
process = CrawlerProcess()
process.crawl(subcategories)
process.start()
Would be exceptionally grateful if you could provides any hints on how to deal with the above issue.
Let me know if you have any questions.
Upvotes: 0
Views: 73
Reputation: 1445
I would suggest you to extract next page number by using this and then construct next page url using this number.
next_page_number = response.css('.nextPage::attr(ng-click)').re_first('\d+')
Upvotes: 1