Scrapy/Selenium: Send more than 3 failed requests before script stops

Question

I am currently trying to crawl a website (around 500 subpages).

The script is working quite smoothly. However, after 3 to 4 hours of running, I am sometimes getting the error message which you can find bellow. I think it is not the script which makes the problems, it is the website server which is quite slowly.

My question is: Is it somehow possible to send more than 3 "failed requests" before the script automatically stops/ closes the spider?

2019-09-27 10:53:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 1 pages/min), scraped 4480 items (at 10 items/min)
2019-09-27 10:54:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying  (failed 1 times): 504 Gateway Time-out
2019-09-27 10:54:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:55:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying  (failed 2 times): 504 Gateway Time-out
2019-09-27 10:55:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:56:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying  (failed 3 times): 504 Gateway Time-out
2019-09-27 10:56:00 [scrapy.core.engine] DEBUG: Crawled (504)  (referer: https://blogabet.com/tipsters) ['partial']
2019-09-27 10:56:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480>: HTTP status code is not handled or not allowed
2019-09-27 10:56:00 [scrapy.core.engine] INFO: Closing spider (finished)

UPDATED CODE ADDED

class AlltipsSpider(Spider):
    name = 'alltips'
    allowed_domains = ['blogabet.com']
    start_urls = ('https://blogabet.com',)

    def parse(self, response):
        self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe') 
        with open("urls.txt", "rt") as f:
            start_urls = [url.strip() for url in f.readlines()]
            for url in start_urls:
                self.driver.get(url)
                self.driver.find_element_by_id('currentTab').click()
                sleep(3)
                self.logger.info('Sleeping for 5 sec.')
                self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
                sleep(7)
                self.logger.info('Sleeping for 7 sec.')

                while True:                 
                    try:
                        element = self.driver.find_element_by_id('last_item')
                        self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
                        sleep(3)
                        self.driver.find_element_by_id('last_item').click()
                        sleep(7)


                    except NoSuchElementException:
                        self.logger.info('No more tipps')

                        sel = Selector(text=self.driver.page_source)

                        allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')

                        for post in allposts:
                            username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
                            publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()



                            yield {'Username': username,
                                'Publish date': publish_date
                        self.driver.quit()
                        break

stranac · Accepted Answer

You can do this by simply changing the RETRY_TIMES setting to a higher number.

You can read about your retry-related options in the RetryMiddleware docs: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-RETRY_TIMES

Scrapy/Selenium: Send more than 3 failed requests before script stops

Answers (1)

Related Questions