Reputation: 1
I am attempting to scrape a catalogue of products from a website using **Scrapy **to click through the category pages and visit each product page individually. 95% of pages load correctly, but the remaining do not load correctly first time so I am unable to select the required data.
The pages load correctly when I scrape them individually, and when I complete a new run, different pages fail to load. So my working theory is, I just need to use the retry middleware to reload these pages.
I have tried to integrate this into my spider following the documentation here https://docs.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retry#module-scrapy.downloadermiddlewares.retry however I am struggling to get it to work.
import logging
from scrapy.utils.log import configure_logging
import scrapy
from datetime import datetime
from Website.items import WebsiteItem
from scrapy.downloadermiddlewares.retry import get_retry_request
logger = logging.getLogger()
current_datetime = datetime.now().strftime("%Y-%m-%d %H-%M-%S")
str_current_datetime = str(current_datetime)
class WebsiteSpider(scrapy.Spider):
configure_logging(install_root_handler=False)
logging.basicConfig(
filename="Websitelog"+str_current_datetime+".txt",
format='%(levelname)s: %(message)s',
level=logging.INFO
)
name = "Webspider"
def start_requests(self):
yield scrapy.Request('https://www.website.com',)
async def parse(self, response):
items = response.css('li.gridItem')
for item in items:
item_url = item.css('h3 a').attrib['href'] #identifies each item on page
yield response.follow(item_url, callback=self.parse_item_page,) #loads each item
next_page = response.css('li.next a').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse,)#loads next page of products
async def parse_item_page(self, response): #parse each product
item_item = WebsiteItem()
item_item['url'] = response.url,
item_item['title'] = response.css('.pd__header ::text').get()
...
item_item['script_price'] = response.xpath('//head/script[contains(text(),"price")]/text()').re_first(r'"price": "[0-9]*\.[0-9]+"').split('"')[3]
yield item_item
I have RETRY_ENABLED = True in settings as well as 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 800, in my DOWNLOADER_MIDDLEWARES.
The error I get on the missing items relates to the 'script_price' field "AttributeError: 'NoneType' object has no attribute 'split'". However I am not sure if that is relevant, as I believe all fields are blank on these items as the page is not loaded.
How can I integrate the retry function into my code, or something else, to ensure these remaining products are loaded and scraped correctly? Maybe using the wait_for setting (I am unsure how to use this as I would need to wait for different elements depending on whether the individual product page or category page is being loaded)?
Upvotes: 0
Views: 58