Square
Square

Reputation: 1

Scrapy Python retry requests on incomplete pages with HTTP 200

I am attempting to scrape a catalogue of products from a website using **Scrapy **to click through the category pages and visit each product page individually. 95% of pages load correctly, but the remaining do not load correctly first time so I am unable to select the required data.

The pages load correctly when I scrape them individually, and when I complete a new run, different pages fail to load. So my working theory is, I just need to use the retry middleware to reload these pages.

I have tried to integrate this into my spider following the documentation here https://docs.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=retry#module-scrapy.downloadermiddlewares.retry however I am struggling to get it to work.

import logging
from scrapy.utils.log import configure_logging
import scrapy

from datetime import datetime

from Website.items import WebsiteItem
from scrapy.downloadermiddlewares.retry import get_retry_request

logger = logging.getLogger()
current_datetime = datetime.now().strftime("%Y-%m-%d %H-%M-%S")
str_current_datetime = str(current_datetime)

class WebsiteSpider(scrapy.Spider):
    configure_logging(install_root_handler=False)
    logging.basicConfig(
        filename="Websitelog"+str_current_datetime+".txt",
        format='%(levelname)s: %(message)s',
        level=logging.INFO
    )
    name = "Webspider"

    def start_requests(self):
        yield scrapy.Request('https://www.website.com',)
        
    async def parse(self, response):
        items = response.css('li.gridItem')
        for item in items:
            item_url = item.css('h3 a').attrib['href'] #identifies each item on page

            yield response.follow(item_url, callback=self.parse_item_page,) #loads each item

        next_page = response.css('li.next a').attrib['href'] 
        if next_page is not None:
            
            yield response.follow(next_page, callback=self.parse,)#loads next page of products

    async def parse_item_page(self, response): #parse each product
        item_item = WebsiteItem()
        item_item['url'] = response.url,   
        item_item['title'] = response.css('.pd__header ::text').get()
        ...
        item_item['script_price'] = response.xpath('//head/script[contains(text(),"price")]/text()').re_first(r'"price": "[0-9]*\.[0-9]+"').split('"')[3]
        
        yield item_item

I have RETRY_ENABLED = True in settings as well as 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 800, in my DOWNLOADER_MIDDLEWARES.

The error I get on the missing items relates to the 'script_price' field "AttributeError: 'NoneType' object has no attribute 'split'". However I am not sure if that is relevant, as I believe all fields are blank on these items as the page is not loaded.

How can I integrate the retry function into my code, or something else, to ensure these remaining products are loaded and scraped correctly? Maybe using the wait_for setting (I am unsure how to use this as I would need to wait for different elements depending on whether the individual product page or category page is being loaded)?

Upvotes: 0

Views: 58

Answers (0)

Related Questions