Reputation: 164
I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404
response. I used dont_filter=True
and 'handle_httpstatus_list': [404]
within meta
to achieve the current behavior. What I'm trying to do now is let the script do the same for 5 times unless there is 200 status
in between. I've included "max_retry_times":5
within meta
considering the fact that it will keep retrying at most five times but it just retries infinitely.
I've tried so far:
import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_urls = [
"https://stackoverflow.com/questions/taggedweb-scraping",
"https://stackoverflow.com/questions/taggedweb-scraping"
]
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
def parse(self,response):
if response.meta.get("start_url"):
start_url = response.meta.get("start_url")
soup = BeautifulSoup(response.text,'lxml')
if soup.select(".summary .question-hyperlink"):
for item in soup.select(".summary .question-hyperlink"):
title_link = response.urljoin(item.get("href"))
print(title_link)
else:
print("++++++++++"*20) # to be sure about the recursion
yield scrapy.Request(start_url,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True,callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
How can I let the script keep retrying at most five times?
Note: There are multiple urls in the list which are identical. I don't wish to kick out the duplicate links. I would like to let scrapy use all of the urls.
Upvotes: 3
Views: 859
Reputation: 207
import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
start_urls = [
"https://stackoverflow.com/questions/taggedweb-scraping",
"https://stackoverflow.com/questions/taggedweb-scraping"
]
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,callback=self.parse,meta={"request_count":0,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
def parse(self,response):
soup = BeautifulSoup(response.text,'lxml')
if soup.select(".summary .question-hyperlink"):
for item in soup.select(".summary .question-hyperlink"):
title_link = response.urljoin(item.get("href"))
print(title_link)
else:
request_count = response.meta.get("request_count")
max_retry_times = response.meta.get("max_retry_times")
if request_count < max_retry_times :
start_url = response.url
request_count += 1
yield scrapy.Request(start_url,meta={"request_count":request_count,'handle_httpstatus_list': [404],"max_retry_times":max_retry_times },dont_filter=True,callback=self.parse)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
})
c.crawl(StackoverflowSpider)
c.start()
I think so. please let me know your opinion if I had mistakes.
Regards!
Upvotes: 1
Reputation: 3561
I can propose following directions:
1.
add 404
code to RETRY_HTTP_CODES
setting as it doesn't include response code 404
by default.
is capable of retrying a link recursively even when the link is invalid and get 404 response
class StackoverflowSpider(scrapy.Spider):
name = "stackoverflow"
custom_settings = {
'RETRY_HTTP_CODES' : [500, 502, 503, 504, 522, 524, 408, 429 , 404],
'RETRY_TIMES': 5 # usage of "max_retry_times" meta key is also valid
}
....
dont_filter=True
- scrapy application will visit previously visited pages. dont_filter=True
from your code should solve infinite loop issuebut it just retries infinitely.
Upvotes: 5
Reputation: 29
I don't know anything about scrapy, so I apologise if this solution will not work.
But for a simple counter I've found a while loop works well. It would look something like this:
x = 5
while not x == 0:
for start_url in self.start_urls:
yield scrapy.Request(start_url,callback=self.parse,meta={"start_url":start_url,'handle_httpstatus_list': [404],"max_retry_times":5},dont_filter=True)
# 200 status check goes here using an if statement that results in a break statement if true
x -= 1
So you perform the action that you wish to perform, then minus 1 from your counter. the action loops through again and again and you minus one from your counter. This repeats until your counter reaches zero, at which point your while loop breaks and you stop looping through your code.
To add a check for 200 status
you add an if check into your code, which I'm afraid I'm not sure how to do, and place it before x -= 1
. If the 200 status
is True
you add a break statement to exit the while loop. Alternatively if you want to use the x
counter later on in your code (say you are running through several different function checks) you could use x = 0
before your break statement.
Again, I don't know anything about scrapy yet, so I apologize if this is not a suitable solution.
Upvotes: -1