MITHU
MITHU

Reputation: 164

Unable to force a script to retry for five times unless a 200 status in between

I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404 response. I used dont_filter=True and 'handle_httpstatus_list': [404] within meta to achieve the current behavior. What I'm trying to do now is let the script do the same for 5 times unless there is 200 status in between. I've included "max_retry_times":5 within meta considering the fact that it will keep retrying at most five times but it just retries infinitely.

I've tried so far:

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    start_urls = [
        "https://stackoverflow.com/questions/taggedweb-scraping",
        "https://stackoverflow.com/questions/taggedweb-scraping"
    ]

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)

    def parse(self,response):
        if response.meta.get("start_url"):
            start_url = response.meta.get("start_url")

        soup = BeautifulSoup(response.text,'lxml')
        if soup.select(".summary .question-hyperlink"):
            for item in soup.select(".summary .question-hyperlink"):
                title_link = response.urljoin(item.get("href"))
                print(title_link)

        else:
            print("++++++++++"*20) # to be sure about the recursion
            yield scrapy.Request(start_url,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True,callback=self.parse)
            
if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

How can I let the script keep retrying at most five times?

Note: There are multiple urls in the list which are identical. I don't wish to kick out the duplicate links. I would like to let scrapy use all of the urls.

Upvotes: 3

Views: 859

Answers (3)

jis0324
jis0324

Reputation: 207

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    start_urls = [
        "https://stackoverflow.com/questions/taggedweb-scraping",
        "https://stackoverflow.com/questions/taggedweb-scraping"
    ]

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"request_count":0,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)

    def parse(self,response):
        
        soup = BeautifulSoup(response.text,'lxml')
        if soup.select(".summary .question-hyperlink"):
            for item in soup.select(".summary .question-hyperlink"):
                title_link = response.urljoin(item.get("href"))
                print(title_link)

        else:
            request_count = response.meta.get("request_count")
            max_retry_times = response.meta.get("max_retry_times")
            if request_count < max_retry_times :
                start_url = response.url
                request_count += 1    
                yield scrapy.Request(start_url,meta={"request_count":request_count,'handle_httpstatus_list': [404],"max_retry_times":max_retry_times },dont_filter=True,callback=self.parse)
            
if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

I think so. please let me know your opinion if I had mistakes.

Regards!

Upvotes: 1

Georgiy
Georgiy

Reputation: 3561

I can propose following directions:
1. add 404 code to RETRY_HTTP_CODES setting as it doesn't include response code 404 by default.

is capable of retrying a link recursively even when the link is invalid and get 404 response

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"
    custom_settings = {
        'RETRY_HTTP_CODES' : [500, 502, 503, 504, 522, 524, 408, 429 , 404],
        'RETRY_TIMES': 5 # usage of "max_retry_times" meta key is also valid
    }
....
  1. with dont_filter=True - scrapy application will visit previously visited pages.
    removing dont_filter=True from your code should solve infinite loop issue

but it just retries infinitely.

Upvotes: 5

Zachary Jones
Zachary Jones

Reputation: 29

I don't know anything about scrapy, so I apologise if this solution will not work.

But for a simple counter I've found a while loop works well. It would look something like this:

x = 5
while not x == 0:
    for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
    # 200 status check goes here using an if statement that results in a break statement if true
    x -= 1

So you perform the action that you wish to perform, then minus 1 from your counter. the action loops through again and again and you minus one from your counter. This repeats until your counter reaches zero, at which point your while loop breaks and you stop looping through your code.

To add a check for 200 status you add an if check into your code, which I'm afraid I'm not sure how to do, and place it before x -= 1. If the 200 status is True you add a break statement to exit the while loop. Alternatively if you want to use the x counter later on in your code (say you are running through several different function checks) you could use x = 0 before your break statement.

Again, I don't know anything about scrapy yet, so I apologize if this is not a suitable solution.

Upvotes: -1

Related Questions