Reputation: 33
I use scrapoxy which implements IP rotation while scraping.
I have a list BLACKLIST_HTTP_STATUS_CODES
of status codes that indicate that the current IP is blocked.
The problem: once you got a response with status code in BLACKLIST_HTTP_STATUS_CODES
scrapoxy downloader middleware raises IgnoreRequest and then changes IP. As the result my script skips the url whose response got bad status code.
Example of logs:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/190> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/191> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/192> (referer: None)
[spider] DEBUG: Ignoring Blacklisted response https://www.some-website.com/profile/193: HTTP status 429
[urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 13.33.33.37:8889
[urllib3.connectionpool] DEBUG: http://13.33.33.37:8889 "POST /api/instances/stop HTTP/1.1" 200 11
[spider] DEBUG: Remove: instance removed (1 instances remaining)
[spider] INFO: Sleeping 89 seconds
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/194> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/195> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/196> (referer: None)
As the result my script skipped https://www.some-website.com/profile/193
.
The goal: I want to retry request whose response got status code that is in BLACKLIST_HTTP_STATUS_CODES
until it is not in that list.
My DownloaderMiddleware looks like that:
class BlacklistDownloaderMiddleware(object):
def __init__(self, crawler):
...
def from_crawler(cls, crawler):
...
def process_response(self, request, response, spider):
"""
Detect blacklisted response and stop the instance if necessary.
"""
try:
# self._http_status_codes is actually BLACKLIST_HTTP_STATUS_CODES
if response.status in self._http_status_codes:
# I have defined BlacklistErorr
raise BlacklistError(response, 'HTTP status {}'.format(response.status))
return response
# THIS IS HOW ORIGINAL CODE LOOKS
except BlacklistError as ex:
# Some logs
spider.log('Ignoring Blacklisted response {0}: {1}'.format(response.url, ex.message), level=logging.DEBUG)
# Get the name of proxy that I need to change
name = response.headers['x-cache-proxyname'].decode('utf-8')
# Change the proxy
self._stop_and_sleep(spider, name)
# drop the url
raise IgnoreRequest()
# MY TRY: I have tried this instead of raising IgnoreRequest but
# it does not work and asks for arguments spider and
# response for self.process_response
# return Request(response.url, callback=self.process_response, dont_filter=True)
Upvotes: 0
Views: 1639
Reputation: 2244
Instead of returning a new Request object, you should copy the original request like retry = request.copy()
. You could check out how Scrapy's RetryMiddleware
handles retries.
For your reference:
def _retry(self, request):
...
retryreq = request.copy()
retryreq.dont_filter = True
...
return retryreq
And you could call it like
def process_response(self, request, response, spider):
try:
if response.status in self._http_status_codes:
name = response.headers['x-cache-proxyname'].decode('utf-8')
self._stop_and_sleep(spider, name)
return self._retry(request)
return response
This should give you the idea.
Upvotes: 2