Scrapy Python retry requests on incomplete pages with HTTP 200

Question

I am attempting to scrape a catalogue of products from a website using **Scrapy **to click through the category pages and visit each product page individually. 95% of pages load correctly, but the remaining do not load correctly first time so I am unable to select the required data.

The pages load correctly when I scrape them individually, and when I complete a new run, different pages fail to load. So my working theory is, I just need to use the retry middleware to reload these pages.

I have tried to integrate this into my spider following the documentation here https://docs.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retry#module-scrapy.downloadermiddlewares.retry however I am struggling to get it to work.

import logging
from scrapy.utils.log import configure_logging
import scrapy

from datetime import datetime

from Website.items import WebsiteItem
from scrapy.downloadermiddlewares.retry import get_retry_request

logger = logging.getLogger()
current_datetime = datetime.now().strftime("%Y-%m-%d %H-%M-%S")
str_current_datetime = str(current_datetime)

class WebsiteSpider(scrapy.Spider):
    configure_logging(install_root_handler=False)
    logging.basicConfig(
        filename="Websitelog"+str_current_datetime+".txt",
        format='%(levelname)s: %(message)s',
        level=logging.INFO
    )
    name = "Webspider"

    def start_requests(self):
        yield scrapy.Request('https://www.website.com',)
        
    async def parse(self, response):
        items = response.css('li.gridItem')
        for item in items:
            item_url = item.css('h3 a').attrib['href'] #identifies each item on page

            yield response.follow(item_url, callback=self.parse_item_page,) #loads each item

        next_page = response.css('li.next a').attrib['href'] 
        if next_page is not None:
            
            yield response.follow(next_page, callback=self.parse,)#loads next page of products

    async def parse_item_page(self, response): #parse each product
        item_item = WebsiteItem()
        item_item['url'] = response.url,   
        item_item['title'] = response.css('.pd__header ::text').get()
        ...
        item_item['script_price'] = response.xpath('//head/script[contains(text(),"price")]/text()').re_first(r'"price": "[0-9]*\.[0-9]+"').split('"')[3]
        
        yield item_item

I have RETRY_ENABLED = True in settings as well as 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 800, in my DOWNLOADER_MIDDLEWARES.

The error I get on the missing items relates to the 'script_price' field "AttributeError: 'NoneType' object has no attribute 'split'". However I am not sure if that is relevant, as I believe all fields are blank on these items as the page is not loaded.

How can I integrate the retry function into my code, or something else, to ensure these remaining products are loaded and scraped correctly? Maybe using the wait_for setting (I am unsure how to use this as I would need to wait for different elements depending on whether the individual product page or category page is being loaded)?

Scrapy Python retry requests on incomplete pages with HTTP 200

Answers (0)

Related Questions